Pokemon TCG Card Extractor with OCR

python ocr sqlite webscraping computer-vision

Introduction

As a Pokemon TCG Pocket player, I found myself manually searching through my collection every time I wanted to check if I had a specific card. The game lets you capture cards, but there's no easy way to export or search your collection. So I built one.

This article documents how I built a complete card extraction system using OCR, web scraping, and SQLite. I'll walk through the architecture, the challenges I faced, and how I solved them.

The Problem

Manually cataloging cards is tedious:

Screenshot a card in the app
Look up the card name in a database
Record it in a spreadsheet

I wanted to automate this: screenshot → OCR → database in seconds, not minutes.

System Architecture

The system has five main components:

Loading diagram...

Component Breakdown

| Component | Purpose | |-----------|---------| | preprocessing/ | Image cropping, contrast enhancement | | extraction/ | Detect Pokemon/Trainer/Energy cards | | ocr_engine/ | EasyOCR + Tesseract for text extraction | | api/local_lookup.py | Multi-signal card matching | | database.py | SQLite collection storage |

Data Collection: Scraping Pokewiki.de

Before I could match cards, I needed a database. I scraped pokewiki.de (German Pokemon wiki) for card data.

Loading diagram...

What I Scraped

2540 unique cards across 17 sets (A1-B2a, PROMO-A, PROMO-B)
124 unique abilities with effect descriptions
4509 image URLs (including reprints)
~200 attack effects with detailed text

Scraping Data Flow

Loading diagram...

Card Detection: Pokemon vs Trainer vs Energy

Not all cards are equal. Pokemon cards have HP, attacks, and abilities. Trainer cards have different fields entirely. I needed to detect the card type first.

Loading diagram...

The detection uses German keywords since the game displays in German:

OCR Extraction: EasyOCR to the Rescue

With the card type known, I extracted text using EasyOCR with German and English models.

Extraction Pipeline

Loading diagram...

Full End-to-End Data Flow

Loading diagram...

Image Preprocessing Pipeline

Loading diagram...

OCR Signal Correction Pipeline

Loading diagram...

Sample Extraction

Input: Card screenshot of Igastarnish (Grass/Bug Pokemon)

Raw OCR Output:

Parsed Signals:

Challenges & Solutions

Challenge 1: OCR Misreads HP Values

Problem: EasyOCR frequently misread HP values. "502" meant "50", "802" meant "80". The extra digit was noise from the KP icon.

Solution: Post-processing regex that strips trailing digits:

Challenge 2: Duplicate Cards in Database

Problem: Some cards appear in multiple sets (reprints). The scraper was creating duplicate entries with different set IDs but the same card name.

Solution: Added deduplication logic that merges entries based on:

Exact German name match
Same HP value
Same Pokédex number

Challenge 3: Missing Card Images

Problem: Initial scrape only got 1483 images. 1969 cards had no image URLs.

Solution: Ran scrape_images.py a second time with more aggressive timeout handling and retry logic:

Challenge 4: Special Illustration Cards

Problem: Special illustration (SAI) cards have different image URLs on pokewiki - they're hosted on a separate CDN with different URL patterns.

Solution: Detect SAI cards by rarity ("4 Star" or "Special Illustration") and use a different URL template:

Challenge 5: Weakness/Retreat Not Extracted

Problem: The regex for weakness and retreat wasn't matching the OCR output. The weakness symbol (Fire+20) appeared on a separate line.

Solution: Improved regex patterns and looked at the full OCR output context:

Card Matching: The Multi-Signal Engine

With extracted signals and a database, I needed a matching algorithm. I implemented a priority-based approach:

Loading diagram...

Confidence Scoring

| Strategy | Confidence | When Used | |----------|------------|-----------| | Name + Set | 95% | Exact German name + set ID match | | Name + HP | 85% | Name fuzzy match + HP match | | HP + Attack + Set | 85% | HP + attack name + set combo | | HP + Weakness + Set | 80% | HP + weakness + set combo | | HP only | 60% | Last resort - just HP match |

Cards below 60% confidence go to failed_to_capture/ for manual review.

Database Design

The collection uses SQLite with a simple but effective schema:

Loading diagram...

Key features:

Quantity tracking: Increment when adding duplicates
Full card data: All fields stored for filtering
Fast lookups: Indexed on name, set_id, hp

Python Core: The Extraction Script

The main entry point is extract_batch_v2.py. Here's how it works:

Image Preprocessing

Card Type Detection

Python Core: The Card Matching Engine

The matching logic in api/local_lookup.py implements multi-signal matching:

Python Core: The Web Scraper

Building the database required multiple scrapers:

Results

After implementing all components:

Extraction time: ~3-5 seconds per card
Success rate: ~85% of cards match at 60%+ confidence
Collection size: Started with 1 card (Ledyba, naturally)
Data coverage: All 2540 German cards with images

Data Collection: Data Flow

This document shows the data flow through the Pokemon TCG Pocket card extraction system.

Loading diagram...

Database Sources

Loading diagram...

Card Matching Priority

Loading diagram...

Data Schema

Input: Screenshot

After OCR: Signals

Database Match: Card Data

Collection Storage

File Transformations

Loading diagram...

Collection Statistics Flow

Loading diagram...

Scraping Workflow

Loading diagram...

Lessons Learned

Post-processing is essential: OCR is never perfect. Build robust correction logic for common failure modes.
Scraping is iterative: First pass rarely gets everything. Plan for multiple passes to fill gaps.
Confidence scoring is subjective: 60% threshold works, but some false positives slip through. Consider user feedback loop.
German text is tricky: Special characters (ü, ö, ä) and compound words cause matching issues. Normalize before comparing.

Future Work

Add image-based matching using card art
Implement mobile app for camera capture
Add duplicate detection from different sets
Build web interface for collection browsing

Conclusion

Building this card extractor taught me a lot about OCR pipelines, web scraping at scale, and multi-signal matching algorithms. The key takeaway: start simple, iterate on failures.

The full source code is available in the project repository. Happy collecting!

Built with Python, EasyOCR, SQLite, and lots of German card data.

Pokemon TCG Card Extractor with OCR

python ocr sqlite webscraping computer-vision

Introduction

This article documents how I built a complete card extraction system using OCR, web scraping, and SQLite. I'll walk through the architecture, the challenges I faced, and how I solved them.

The Problem

Manually cataloging cards is tedious:

Screenshot a card in the app
Look up the card name in a database
Record it in a spreadsheet

I wanted to automate this: screenshot → OCR → database in seconds, not minutes.

System Architecture

The system has five main components:

Loading diagram...

Component Breakdown

Data Collection: Scraping Pokewiki.de

Before I could match cards, I needed a database. I scraped pokewiki.de (German Pokemon wiki) for card data.

Loading diagram...

What I Scraped

2540 unique cards across 17 sets (A1-B2a, PROMO-A, PROMO-B)
124 unique abilities with effect descriptions
4509 image URLs (including reprints)
~200 attack effects with detailed text

Scraping Data Flow

Loading diagram...

Card Detection: Pokemon vs Trainer vs Energy

Not all cards are equal. Pokemon cards have HP, attacks, and abilities. Trainer cards have different fields entirely. I needed to detect the card type first.

Loading diagram...

The detection uses German keywords since the game displays in German:

OCR Extraction: EasyOCR to the Rescue

With the card type known, I extracted text using EasyOCR with German and English models.

Extraction Pipeline

Loading diagram...

Full End-to-End Data Flow

Loading diagram...

Image Preprocessing Pipeline

Loading diagram...

OCR Signal Correction Pipeline

Loading diagram...

Sample Extraction

Input: Card screenshot of Igastarnish (Grass/Bug Pokemon)

Raw OCR Output:

Parsed Signals:

Challenges & Solutions

Challenge 1: OCR Misreads HP Values

Problem: EasyOCR frequently misread HP values. "502" meant "50", "802" meant "80". The extra digit was noise from the KP icon.

Solution: Post-processing regex that strips trailing digits:

Challenge 2: Duplicate Cards in Database

Problem: Some cards appear in multiple sets (reprints). The scraper was creating duplicate entries with different set IDs but the same card name.

Solution: Added deduplication logic that merges entries based on:

Exact German name match
Same HP value
Same Pokédex number

Challenge 3: Missing Card Images

Problem: Initial scrape only got 1483 images. 1969 cards had no image URLs.

Solution: Ran scrape_images.py a second time with more aggressive timeout handling and retry logic:

Challenge 4: Special Illustration Cards

Problem: Special illustration (SAI) cards have different image URLs on pokewiki - they're hosted on a separate CDN with different URL patterns.

Solution: Detect SAI cards by rarity ("4 Star" or "Special Illustration") and use a different URL template:

Challenge 5: Weakness/Retreat Not Extracted

Problem: The regex for weakness and retreat wasn't matching the OCR output. The weakness symbol (Fire+20) appeared on a separate line.

Solution: Improved regex patterns and looked at the full OCR output context:

Card Matching: The Multi-Signal Engine

With extracted signals and a database, I needed a matching algorithm. I implemented a priority-based approach:

Loading diagram...

Confidence Scoring

Cards below 60% confidence go to failed_to_capture/ for manual review.

Database Design

The collection uses SQLite with a simple but effective schema:

Loading diagram...

Key features:

Quantity tracking: Increment when adding duplicates
Full card data: All fields stored for filtering
Fast lookups: Indexed on name, set_id, hp

Python Core: The Extraction Script

The main entry point is extract_batch_v2.py. Here's how it works:

Image Preprocessing

Card Type Detection

Python Core: The Card Matching Engine

The matching logic in api/local_lookup.py implements multi-signal matching:

Python Core: The Web Scraper

Building the database required multiple scrapers:

Results

After implementing all components:

Extraction time: ~3-5 seconds per card
Success rate: ~85% of cards match at 60%+ confidence
Collection size: Started with 1 card (Ledyba, naturally)
Data coverage: All 2540 German cards with images

Data Collection: Data Flow

This document shows the data flow through the Pokemon TCG Pocket card extraction system.

Loading diagram...

Database Sources

Loading diagram...

Card Matching Priority

Loading diagram...

Data Schema

Input: Screenshot

After OCR: Signals

Database Match: Card Data

Collection Storage

File Transformations

Loading diagram...

Collection Statistics Flow

Loading diagram...

Scraping Workflow

Loading diagram...

Lessons Learned

Post-processing is essential: OCR is never perfect. Build robust correction logic for common failure modes.
Scraping is iterative: First pass rarely gets everything. Plan for multiple passes to fill gaps.
Confidence scoring is subjective: 60% threshold works, but some false positives slip through. Consider user feedback loop.
German text is tricky: Special characters (ü, ö, ä) and compound words cause matching issues. Normalize before comparing.

Future Work

Add image-based matching using card art
Implement mobile app for camera capture
Add duplicate detection from different sets
Build web interface for collection browsing

Conclusion

Building this card extractor taught me a lot about OCR pipelines, web scraping at scale, and multi-signal matching algorithms. The key takeaway: start simple, iterate on failures.

The full source code is available in the project repository. Happy collecting!

Built with Python, EasyOCR, SQLite, and lots of German card data.