ReviewProof

python web-scraping playwright medical-data bielefeld data-collection

This project is a Python-based web scraping system designed to collect comprehensive data about local businesses in Bielefeld, Germany - including restaurants, medical practitioners, cafes, shops, and more. The scraper uses Playwright (a modern browser automation library) to interact with Google Search and extract business information including ratings, review counts, addresses, phone numbers, and GDPR-related deletion notices.

Scraper Interface

Key Objectives:

Collect data on 500+ businesses across multiple categories
Capture ratings, review counts, addresses, and deleted review notices
Maintain data quality through strict validation rules
Build a reusable, maintainable scraper architecture
Support multiple business types: restaurants, doctors, shops, cafes

Architecture

System Overview

Loading diagram...

Component Architecture

Loading diagram...

Data Pipeline

End-to-End Flow

Loading diagram...

Phase 1: Search & Collection

The scraper uses a multi-term search strategy to maximize coverage across different business types:

Each term searches 3 pages times 10 results equals 30 potential entries per term. With 6 search terms, that is up to 180 potential entries.

Search Results

Phase 2: Name Extraction

Loading diagram...

Phase 3: Individual Scraping

Loading diagram...

Scraping Methodology

Browser Setup

Key configurations:

headless=False: For visual debugging
AutomationControlled flag: Bypass bot detection
no-sandbox: Required for some Linux environments

Search URL Pattern

| Parameter | Value | Purpose | |-----------|-------|---------| | q | Search term | Google query | | tbm=lcl | local | Local business results | | hl=de-DE | German | German language results | | start | 0, 10, 20 | Pagination |

Loading diagram...

Data Extraction Patterns

Rating Pattern

Address Pattern

Data Validation & Cleaning

Validation Rules

Loading diagram...

Filtering Math

The name filtering algorithm uses precise mathematical constraints. It validates names between 8-70 characters and uses keyword set operations to ensure only valid business names are included. Fuzzy matching with Levenshtein distance helps detect and remove duplicate entries.

Business Keywords (Must Match)

The filtering system supports multiple business categories:

Rejection Patterns

Clean Categories

The system auto-categorizes businesses into:

Medical:

Doctors (Dr., Dr. med., Prof.)
Practices (Praxis, Gemeinschaftspraxis)
Clinics (Klinik, Krankenhaus)
Dentists (Zahnarzt, Zahnärzte)
Specialists (Facharzt, Centrum)
Therapy centers

Restaurants & Food:

Restaurants (Restaurant, Gastronomie)
Cafes (Café, Konditorei)
Fast food (Pizza, Döner, Grill)
Bakeries (Bäckerei)

Retail & Services:

Shops and stores
Electronics (Elektro, Elektronik)
Fashion (Mode, Boutique)
Markets (Markt)

Validation Process

Statistical Formulas

Data quality is measured using statistical metrics:

Mean rating: mean = SUM(x) / n
Standard deviation: std = sqrt(SUM((x-mean)^2)/n)
Data completeness rates:
- Rating completeness: R = count(rating) / total * 100%
- Address completeness: A = count(address) / total * 100%
- Deleted reviews: D = count(deleted) / total * 100%
Rating histogram bins: | Bin | Range | Count | |-----|-------|-------| | 1 | 4.5-5.0 | 89 | | 2 | 4.0-4.4 | 45 | | 3 | 3.5-3.9 | 18 | | 4 | under 3.5 | 5 |
Category distribution percentages:
- Doctor: 62/157 = 39.5%
- Dentist: 35/157 = 22.3%
- Clinic: 28/157 = 17.8%
- Medical Practice: 20/157 = 12.7%
- Specialist: 12/157 = 7.6%

Timing Calculations

Performance timing is calculated using:

Page load time estimates:
- Average load: t_load = 1.8s +/- 0.5s
- Timeout threshold: t_max = 10s
Random delay distribution: U(1.0, 2.0) seconds between requests
Total runtime formula:

For 157 entries: T = 157 * (1.8 + 0.3 + 0.1) + 156 * 1.5 = 680s = 11 minutes

Performance Analysis

| Operation | Time Complexity | Description | |-----------|----------------|-------------| | Name filtering | O(n) | Linear scan of token set | | Fuzzy matching | O(n * m) | Levenshtein on all pairs | | Rating extraction | O(1) | Single regex match | | DB insert/update | O(1) | Hash table lookup | | CSV export | O(n) | Full table scan |

Memory estimation:
- Raw HTML buffer: ~5MB per page
- Name set: ~50KB for 1000 names
- SQLite DB: ~500KB for 157 entries
- Total peak: ~10MB
Throughput calculation:

Advanced Calculations

Haversine distance (for future map visualization):
Google Maps coordinate extraction:
Quality scoring algorithm (0-100):

Database Schema

SQLite Schema

Entity Relationship

Loading diagram...

Field Descriptions

| Field | Type | Description | Example | |-------|------|-------------|---------| | id | INTEGER | Primary key | 1 | | name | TEXT | Business name | "Dr. med. Hans Müller" | | rating | TEXT | Rating (1.0-5.0) | "4.5" | | total_reviews | TEXT | Number of reviews | "103" | | deleted_reviews | TEXT | GDPR deletion notice | "Einige Ergebnisse..." | | address | TEXT | Full address | "Adresse: Hauptstr. 1, 33602 Bielefeld" | | url | TEXT | Google Maps URL | "https://www.google.com/maps/..." | | category | TEXT | Auto-tagged category | "Doctor", "Dentist", "Clinic" | | scrape_date | TEXT | First scrape date | "2026-05-11" | | last_updated | TEXT | Last update date | "2026-05-11" | | status | TEXT | Active/inactive | "active" |

Category Auto-Tagging

Data Flow Diagrams

Complete Pipeline

Loading diagram...

Error Handling Flow

Loading diagram...

Challenges & Solutions

Challenge 1: Dynamic Google UI

Problem: Google frequently changes their HTML structure and CSS class names.

Solution: Use text-based extraction instead of CSS selectors:

Before (Selector-based):

After (Text-based):

Problem: Google shows cookie consent overlay that blocks content.

Solution: Automated button clicking with multiple attempts:

Challenge 3: Non-Medical Entries

Problem: Search results include gyms, pharmacies, opticians.

Solution: Strict filtering with keyword validation:

Challenge 4: Duplicate Entries

Problem: Same clinic appears under different names (e.g., "EvKB Haus Gilead I" vs "EvKB Haus Gilead IV").

Solution: Database-level dedup with name matching:

Challenge 5: Missing Rating Data

Problem: Some entries don't display ratings in search snippets.

Solution: Strict validation - only save entries with complete data:

Scraping Results

Results & Statistics

Current Dataset

| Metric | Value | |--------|-------| | Total Entries | 157 | | With Ratings | 157 (100%) | | With Reviews | 157 (100%) | | With Deleted Reviews | 58 (37%) | | With Addresses | 16 (10%) |

Rating Distribution

Loading diagram...

Category Breakdown

| Category | Count | Percentage | |----------|-------|------------| | Doctor | 62 | 39% | | Dentist | 35 | 22% | | Clinic | 28 | 18% | | Medical Practice | 20 | 13% | | Specialist | 12 | 8% |

Deleted Reviews Analysis

Loading diagram...

Results Dashboard

Technical Stack

Dependencies

File Structure

Database Structure

Future Enhancements

Planned Improvements

Parallel Processing
- Use multiple browser contexts simultaneously
- Reduce total scraping time by 50%
Enhanced Validation
- Fuzzy name matching for duplicate detection
- URL-based deduplication as backup

Address Extraction
- Improve regex patterns for German addresses
- Parse multiple address formats

Review Content
- Collect actual review text (with consent)
- Sentiment analysis on reviews
Monitoring
- Track changes over time (re-scrape detection)
- Alert on rating changes

Potential Extensions

Appendix: Code Reference

Main Loop Structure

CSV Export Function

Conclusion

This project demonstrates a practical approach to automated medical data collection from web search results. Key learnings include:

Text-based extraction is more robust than CSS selectors for dynamic web pages
Strict validation ensures high data quality even if it means fewer entries

Modular design allows easy maintenance and extension

Automated cleaning catches bad entries that slip through initial filtering
Mathematical validation provides quantifiable data quality metrics

The scraper successfully collects validated medical practitioner data from Bielefeld, with 100% of entries containing ratings and review counts. The architecture is designed for extensibility to other cities and data sources.

Document Version: 1.1
Last Updated: May 11, 2026
License: MIT

ReviewProof

python web-scraping playwright medical-data bielefeld data-collection

Scraper Interface

Key Objectives:

Collect data on 500+ businesses across multiple categories
Capture ratings, review counts, addresses, and deleted review notices
Maintain data quality through strict validation rules
Build a reusable, maintainable scraper architecture
Support multiple business types: restaurants, doctors, shops, cafes

Architecture

System Overview

Loading diagram...

Component Architecture

Loading diagram...

Data Pipeline

End-to-End Flow

Loading diagram...

Phase 1: Search & Collection

The scraper uses a multi-term search strategy to maximize coverage across different business types:

Each term searches 3 pages times 10 results equals 30 potential entries per term. With 6 search terms, that is up to 180 potential entries.

Search Results

Phase 2: Name Extraction

Loading diagram...

Phase 3: Individual Scraping

Loading diagram...

Scraping Methodology

Browser Setup

Key configurations:

headless=False: For visual debugging
AutomationControlled flag: Bypass bot detection
no-sandbox: Required for some Linux environments

Search URL Pattern

Loading diagram...

Data Extraction Patterns

Rating Pattern

Address Pattern

Data Validation & Cleaning

Validation Rules

Loading diagram...

Filtering Math

Business Keywords (Must Match)

The filtering system supports multiple business categories:

Rejection Patterns

Clean Categories

The system auto-categorizes businesses into:

Medical:

Doctors (Dr., Dr. med., Prof.)
Practices (Praxis, Gemeinschaftspraxis)
Clinics (Klinik, Krankenhaus)
Dentists (Zahnarzt, Zahnärzte)
Specialists (Facharzt, Centrum)
Therapy centers

Restaurants & Food:

Restaurants (Restaurant, Gastronomie)
Cafes (Café, Konditorei)
Fast food (Pizza, Döner, Grill)
Bakeries (Bäckerei)

Retail & Services:

Shops and stores
Electronics (Elektro, Elektronik)
Fashion (Mode, Boutique)
Markets (Markt)

Validation Process

Statistical Formulas

Data quality is measured using statistical metrics:

Mean rating: mean = SUM(x) / n
Standard deviation: std = sqrt(SUM((x-mean)^2)/n)
Data completeness rates:
- Rating completeness: R = count(rating) / total * 100%
- Address completeness: A = count(address) / total * 100%
- Deleted reviews: D = count(deleted) / total * 100%
Rating histogram bins: | Bin | Range | Count | |-----|-------|-------| | 1 | 4.5-5.0 | 89 | | 2 | 4.0-4.4 | 45 | | 3 | 3.5-3.9 | 18 | | 4 | under 3.5 | 5 |
Category distribution percentages:
- Doctor: 62/157 = 39.5%
- Dentist: 35/157 = 22.3%
- Clinic: 28/157 = 17.8%
- Medical Practice: 20/157 = 12.7%
- Specialist: 12/157 = 7.6%

Timing Calculations

Performance timing is calculated using:

Page load time estimates:
- Average load: t_load = 1.8s +/- 0.5s
- Timeout threshold: t_max = 10s
Random delay distribution: U(1.0, 2.0) seconds between requests
Total runtime formula:

For 157 entries: T = 157 * (1.8 + 0.3 + 0.1) + 156 * 1.5 = 680s = 11 minutes

Performance Analysis

Memory estimation:
- Raw HTML buffer: ~5MB per page
- Name set: ~50KB for 1000 names
- SQLite DB: ~500KB for 157 entries
- Total peak: ~10MB
Throughput calculation:

Advanced Calculations

Haversine distance (for future map visualization):
Google Maps coordinate extraction:
Quality scoring algorithm (0-100):

Database Schema

SQLite Schema

Entity Relationship

Loading diagram...

Field Descriptions

Category Auto-Tagging

Data Flow Diagrams

Complete Pipeline

Loading diagram...

Error Handling Flow

Loading diagram...

Challenges & Solutions

Challenge 1: Dynamic Google UI

Problem: Google frequently changes their HTML structure and CSS class names.

Solution: Use text-based extraction instead of CSS selectors:

Before (Selector-based):

After (Text-based):

Problem: Google shows cookie consent overlay that blocks content.

Solution: Automated button clicking with multiple attempts:

Challenge 3: Non-Medical Entries

Problem: Search results include gyms, pharmacies, opticians.

Solution: Strict filtering with keyword validation:

Challenge 4: Duplicate Entries

Problem: Same clinic appears under different names (e.g., "EvKB Haus Gilead I" vs "EvKB Haus Gilead IV").

Solution: Database-level dedup with name matching:

Challenge 5: Missing Rating Data

Problem: Some entries don't display ratings in search snippets.

Solution: Strict validation - only save entries with complete data:

Scraping Results

Results & Statistics

Current Dataset

| Metric | Value | |--------|-------| | Total Entries | 157 | | With Ratings | 157 (100%) | | With Reviews | 157 (100%) | | With Deleted Reviews | 58 (37%) | | With Addresses | 16 (10%) |

Rating Distribution

Loading diagram...

Category Breakdown

| Category | Count | Percentage | |----------|-------|------------| | Doctor | 62 | 39% | | Dentist | 35 | 22% | | Clinic | 28 | 18% | | Medical Practice | 20 | 13% | | Specialist | 12 | 8% |

Deleted Reviews Analysis

Loading diagram...

Results Dashboard

Technical Stack

Dependencies

File Structure

Database Structure

Future Enhancements

Planned Improvements

Parallel Processing
- Use multiple browser contexts simultaneously
- Reduce total scraping time by 50%
Enhanced Validation
- Fuzzy name matching for duplicate detection
- URL-based deduplication as backup

Address Extraction
- Improve regex patterns for German addresses
- Parse multiple address formats

Review Content
- Collect actual review text (with consent)
- Sentiment analysis on reviews
Monitoring
- Track changes over time (re-scrape detection)
- Alert on rating changes

Potential Extensions

Appendix: Code Reference

Main Loop Structure

CSV Export Function

Conclusion

This project demonstrates a practical approach to automated medical data collection from web search results. Key learnings include:

Text-based extraction is more robust than CSS selectors for dynamic web pages
Strict validation ensures high data quality even if it means fewer entries

Modular design allows easy maintenance and extension

Automated cleaning catches bad entries that slip through initial filtering
Mathematical validation provides quantifiable data quality metrics

Document Version: 1.1
Last Updated: May 11, 2026
License: MIT

Architecture

System Overview

Component Architecture

Data Pipeline

End-to-End Flow

Phase 1: Search & Collection

Phase 2: Name Extraction

Phase 3: Individual Scraping

Scraping Methodology

Browser Setup

Search URL Pattern

Consent Handling

Data Extraction Patterns

Rating Pattern

Address Pattern

Deleted Reviews (GDPR)

Data Validation & Cleaning

Validation Rules

Filtering Math

Business Keywords (Must Match)

Rejection Patterns

Clean Categories

Statistical Formulas

Timing Calculations

Performance Analysis

Advanced Calculations

Database Schema

SQLite Schema

Entity Relationship

Field Descriptions

Category Auto-Tagging

Data Flow Diagrams

Complete Pipeline

Error Handling Flow

Challenges & Solutions

Challenge 1: Dynamic Google UI

Challenge 2: Consent Cookie Banners

Challenge 3: Non-Medical Entries

Challenge 4: Duplicate Entries

Challenge 5: Missing Rating Data

Results & Statistics

Current Dataset

Rating Distribution

Category Breakdown

Deleted Reviews Analysis

Technical Stack

Dependencies

File Structure

Future Enhancements

Planned Improvements

Potential Extensions

Appendix: Code Reference

Main Loop Structure

CSV Export Function

Conclusion

Architecture

System Overview

Component Architecture

Data Pipeline

End-to-End Flow

Phase 1: Search & Collection

Phase 2: Name Extraction

Phase 3: Individual Scraping

Scraping Methodology

Browser Setup

Search URL Pattern

Consent Handling

Data Extraction Patterns

Rating Pattern

Address Pattern

Deleted Reviews (GDPR)

Data Validation & Cleaning

Validation Rules

Filtering Math

Business Keywords (Must Match)

Rejection Patterns

Clean Categories

Statistical Formulas

Timing Calculations

Performance Analysis