Pokemon TCG Card Extractor with OCR

#Introduction

As a Pokemon TCG Pocket player, I found myself manually searching through my collection every time I wanted to check if I had a specific card. The game lets you capture cards, but there's no easy way to export or search your collection. So I built one.

This article documents how I built a complete card extraction system using OCR, web scraping, and SQLite. I'll walk through the architecture, the challenges I faced, and how I solved them.


#The Problem

Manually cataloging cards is tedious:

  • Screenshot a card in the app
  • Look up the card name in a database
  • Record it in a spreadsheet

I wanted to automate this: screenshot → OCR → database in seconds, not minutes.


#System Architecture

The system has five main components:

Loading diagram...

#Component Breakdown

| Component | Purpose | |-----------|---------| | preprocessing/ | Image cropping, contrast enhancement | | extraction/ | Detect Pokemon/Trainer/Energy cards | | ocr_engine/ | EasyOCR + Tesseract for text extraction | | api/local_lookup.py | Multi-signal card matching | | database.py | SQLite collection storage |


#Data Collection: Scraping Pokewiki.de

Before I could match cards, I needed a database. I scraped pokewiki.de (German Pokemon wiki) for card data.

Loading diagram...

#What I Scraped

  • 2540 unique cards across 17 sets (A1-B2a, PROMO-A, PROMO-B)
  • 124 unique abilities with effect descriptions
  • 4509 image URLs (including reprints)
  • ~200 attack effects with detailed text

#Scraping Data Flow

Loading diagram...

#Card Detection: Pokemon vs Trainer vs Energy

Not all cards are equal. Pokemon cards have HP, attacks, and abilities. Trainer cards have different fields entirely. I needed to detect the card type first.

Loading diagram...

The detection uses German keywords since the game displays in German:


#OCR Extraction: EasyOCR to the Rescue

With the card type known, I extracted text using EasyOCR with German and English models.

#Extraction Pipeline

Loading diagram...

#Full End-to-End Data Flow

Loading diagram...

#Image Preprocessing Pipeline

Loading diagram...

#OCR Signal Correction Pipeline

Loading diagram...

#Sample Extraction

Input: Card screenshot of Igastarnish (Grass/Bug Pokemon)

Raw OCR Output:

Parsed Signals:


#Challenges & Solutions

#Challenge 1: OCR Misreads HP Values

Problem: EasyOCR frequently misread HP values. "502" meant "50", "802" meant "80". The extra digit was noise from the KP icon.

Solution: Post-processing regex that strips trailing digits:

#Challenge 2: Duplicate Cards in Database

Problem: Some cards appear in multiple sets (reprints). The scraper was creating duplicate entries with different set IDs but the same card name.

Solution: Added deduplication logic that merges entries based on:

  • Exact German name match
  • Same HP value
  • Same Pokédex number

#Challenge 3: Missing Card Images

Problem: Initial scrape only got 1483 images. 1969 cards had no image URLs.

Solution: Ran scrape_images.py a second time with more aggressive timeout handling and retry logic:

#Challenge 4: Special Illustration Cards

Problem: Special illustration (SAI) cards have different image URLs on pokewiki - they're hosted on a separate CDN with different URL patterns.

Solution: Detect SAI cards by rarity ("4 Star" or "Special Illustration") and use a different URL template:

#Challenge 5: Weakness/Retreat Not Extracted

Problem: The regex for weakness and retreat wasn't matching the OCR output. The weakness symbol (Fire+20) appeared on a separate line.

Solution: Improved regex patterns and looked at the full OCR output context:


#Card Matching: The Multi-Signal Engine

With extracted signals and a database, I needed a matching algorithm. I implemented a priority-based approach:

Loading diagram...

#Confidence Scoring

| Strategy | Confidence | When Used | |----------|------------|-----------| | Name + Set | 95% | Exact German name + set ID match | | Name + HP | 85% | Name fuzzy match + HP match | | HP + Attack + Set | 85% | HP + attack name + set combo | | HP + Weakness + Set | 80% | HP + weakness + set combo | | HP only | 60% | Last resort - just HP match |

Cards below 60% confidence go to failed_to_capture/ for manual review.


#Database Design

The collection uses SQLite with a simple but effective schema:

Key features:

  • Quantity tracking: Increment when adding duplicates
  • Full card data: All fields stored for filtering
  • Fast lookups: Indexed on name, set_id, hp

#Python Core: The Extraction Script

The main entry point is extract_batch_v2.py. Here's how it works:

#Image Preprocessing

#Card Type Detection


#Python Core: The Card Matching Engine

The matching logic in api/local_lookup.py implements multi-signal matching:


#Python Core: The Web Scraper

Building the database required multiple scrapers:


#Results

After implementing all components:

  • Extraction time: ~3-5 seconds per card
  • Success rate: ~85% of cards match at 60%+ confidence
  • Collection size: Started with 1 card (Ledyba, naturally)
  • Data coverage: All 2540 German cards with images

#Data Collection: Data Flow

This document shows the data flow through the Pokemon TCG Pocket card extraction system.

Loading diagram...

#Database Sources

Loading diagram...

#Card Matching Priority

Loading diagram...

#Data Schema

Input: Screenshot

After OCR: Signals

Database Match: Card Data

Collection Storage

#File Transformations

Loading diagram...

#Collection Statistics Flow

Loading diagram...

#Scraping Workflow

Loading diagram...

#Lessons Learned

  1. Post-processing is essential: OCR is never perfect. Build robust correction logic for common failure modes.

  2. Scraping is iterative: First pass rarely gets everything. Plan for multiple passes to fill gaps.

  3. Confidence scoring is subjective: 60% threshold works, but some false positives slip through. Consider user feedback loop.

  4. German text is tricky: Special characters (ü, ö, ä) and compound words cause matching issues. Normalize before comparing.


#Future Work

  • Add image-based matching using card art
  • Implement mobile app for camera capture
  • Add duplicate detection from different sets
  • Build web interface for collection browsing

#Conclusion

Building this card extractor taught me a lot about OCR pipelines, web scraping at scale, and multi-signal matching algorithms. The key takeaway: start simple, iterate on failures.

The full source code is available in the project repository. Happy collecting!


Built with Python, EasyOCR, SQLite, and lots of German card data.

Created:
4/9/2026
Last Updated:
4/9/2026