ReviewProof

This project is a Python-based web scraping system designed to collect comprehensive data about local businesses in Bielefeld, Germany - including restaurants, medical practitioners, cafes, shops, and more. The scraper uses Playwright (a modern browser automation library) to interact with Google Search and extract business information including ratings, review counts, addresses, phone numbers, and GDPR-related deletion notices.

Scraper Interface

Key Objectives:

  • Collect data on 500+ businesses across multiple categories
  • Capture ratings, review counts, addresses, and deleted review notices
  • Maintain data quality through strict validation rules
  • Build a reusable, maintainable scraper architecture
  • Support multiple business types: restaurants, doctors, shops, cafes

Architecture

System Overview

Loading diagram...

Component Architecture

Loading diagram...

Data Pipeline

End-to-End Flow

Loading diagram...

Phase 1: Search & Collection

The scraper uses a multi-term search strategy to maximize coverage across different business types:

Each term searches 3 pages times 10 results equals 30 potential entries per term. With 6 search terms, that is up to 180 potential entries.

Search Results

Phase 2: Name Extraction

Loading diagram...

Phase 3: Individual Scraping

Loading diagram...

Scraping Methodology

Browser Setup

Key configurations:

  • headless=False: For visual debugging
  • AutomationControlled flag: Bypass bot detection
  • no-sandbox: Required for some Linux environments

Search URL Pattern

| Parameter | Value | Purpose | |-----------|-------|---------| | q | Search term | Google query | | tbm=lcl | local | Local business results | | hl=de-DE | German | German language results | | start | 0, 10, 20 | Pagination |

Loading diagram...

Data Extraction Patterns

Rating Pattern

Address Pattern

Deleted Reviews (GDPR)

Data Validation & Cleaning

Validation Rules

Loading diagram...

Filtering Math

The name filtering algorithm uses precise mathematical constraints. It validates names between 8-70 characters and uses keyword set operations to ensure only valid business names are included. Fuzzy matching with Levenshtein distance helps detect and remove duplicate entries.

Business Keywords (Must Match)

The filtering system supports multiple business categories:

Rejection Patterns

Clean Categories

The system auto-categorizes businesses into:

Medical:

  • Doctors (Dr., Dr. med., Prof.)
  • Practices (Praxis, Gemeinschaftspraxis)
  • Clinics (Klinik, Krankenhaus)
  • Dentists (Zahnarzt, Zahnärzte)
  • Specialists (Facharzt, Centrum)
  • Therapy centers

Restaurants & Food:

  • Restaurants (Restaurant, Gastronomie)
  • Cafes (Café, Konditorei)
  • Fast food (Pizza, Döner, Grill)
  • Bakeries (Bäckerei)

Retail & Services:

  • Shops and stores
  • Electronics (Elektro, Elektronik)
  • Fashion (Mode, Boutique)
  • Markets (Markt)

Validation Process

Statistical Formulas

Data quality is measured using statistical metrics:

  • Mean rating: mean = SUM(x) / n

  • Standard deviation: std = sqrt(SUM((x-mean)^2)/n)

  • Data completeness rates:

    • Rating completeness: R = count(rating) / total * 100%
    • Address completeness: A = count(address) / total * 100%
    • Deleted reviews: D = count(deleted) / total * 100%
  • Rating histogram bins: | Bin | Range | Count | |-----|-------|-------| | 1 | 4.5-5.0 | 89 | | 2 | 4.0-4.4 | 45 | | 3 | 3.5-3.9 | 18 | | 4 | under 3.5 | 5 |

  • Category distribution percentages:

    • Doctor: 62/157 = 39.5%
    • Dentist: 35/157 = 22.3%
    • Clinic: 28/157 = 17.8%
    • Medical Practice: 20/157 = 12.7%
    • Specialist: 12/157 = 7.6%

Timing Calculations

Performance timing is calculated using:

  • Page load time estimates:

    • Average load: t_load = 1.8s +/- 0.5s
    • Timeout threshold: t_max = 10s
  • Random delay distribution: U(1.0, 2.0) seconds between requests

  • Total runtime formula:

    For 157 entries: T = 157 * (1.8 + 0.3 + 0.1) + 156 * 1.5 = 680s = 11 minutes

Performance Analysis

| Operation | Time Complexity | Description | |-----------|----------------|-------------| | Name filtering | O(n) | Linear scan of token set | | Fuzzy matching | O(n * m) | Levenshtein on all pairs | | Rating extraction | O(1) | Single regex match | | DB insert/update | O(1) | Hash table lookup | | CSV export | O(n) | Full table scan |

  • Memory estimation:

    • Raw HTML buffer: ~5MB per page
    • Name set: ~50KB for 1000 names
    • SQLite DB: ~500KB for 157 entries
    • Total peak: ~10MB
  • Throughput calculation:

Advanced Calculations

  • Haversine distance (for future map visualization):

  • Google Maps coordinate extraction:

  • Quality scoring algorithm (0-100):

Database Schema

SQLite Schema

Entity Relationship

Loading diagram...

Field Descriptions

| Field | Type | Description | Example | |-------|------|-------------|---------| | id | INTEGER | Primary key | 1 | | name | TEXT | Business name | "Dr. med. Hans Müller" | | rating | TEXT | Rating (1.0-5.0) | "4.5" | | total_reviews | TEXT | Number of reviews | "103" | | deleted_reviews | TEXT | GDPR deletion notice | "Einige Ergebnisse..." | | address | TEXT | Full address | "Adresse: Hauptstr. 1, 33602 Bielefeld" | | url | TEXT | Google Maps URL | "https://www.google.com/maps/..." | | category | TEXT | Auto-tagged category | "Doctor", "Dentist", "Clinic" | | scrape_date | TEXT | First scrape date | "2026-05-11" | | last_updated | TEXT | Last update date | "2026-05-11" | | status | TEXT | Active/inactive | "active" |

Category Auto-Tagging

Data Flow Diagrams

Complete Pipeline

Loading diagram...

Error Handling Flow

Loading diagram...

Challenges & Solutions

Challenge 1: Dynamic Google UI

Problem: Google frequently changes their HTML structure and CSS class names.

Solution: Use text-based extraction instead of CSS selectors:

Before (Selector-based):

After (Text-based):

Problem: Google shows cookie consent overlay that blocks content.

Solution: Automated button clicking with multiple attempts:

Challenge 3: Non-Medical Entries

Problem: Search results include gyms, pharmacies, opticians.

Solution: Strict filtering with keyword validation:

Challenge 4: Duplicate Entries

Problem: Same clinic appears under different names (e.g., "EvKB Haus Gilead I" vs "EvKB Haus Gilead IV").

Solution: Database-level dedup with name matching:

Challenge 5: Missing Rating Data

Problem: Some entries don't display ratings in search snippets.

Solution: Strict validation - only save entries with complete data:

Scraping Results

Results & Statistics

Current Dataset

| Metric | Value | |--------|-------| | Total Entries | 157 | | With Ratings | 157 (100%) | | With Reviews | 157 (100%) | | With Deleted Reviews | 58 (37%) | | With Addresses | 16 (10%) |

Rating Distribution

Loading diagram...

Category Breakdown

| Category | Count | Percentage | |----------|-------|------------| | Doctor | 62 | 39% | | Dentist | 35 | 22% | | Clinic | 28 | 18% | | Medical Practice | 20 | 13% | | Specialist | 12 | 8% |

Deleted Reviews Analysis

Loading diagram...

Results Dashboard

Technical Stack

Dependencies

File Structure

Database Structure

Future Enhancements

Planned Improvements

  1. Parallel Processing

    • Use multiple browser contexts simultaneously
    • Reduce total scraping time by 50%
  2. Enhanced Validation

    • Fuzzy name matching for duplicate detection
    • URL-based deduplication as backup
  1. Address Extraction
    • Improve regex patterns for German addresses
    • Parse multiple address formats
  1. Review Content

    • Collect actual review text (with consent)
    • Sentiment analysis on reviews
  2. Monitoring

    • Track changes over time (re-scrape detection)
    • Alert on rating changes

Potential Extensions

Appendix: Code Reference

Main Loop Structure

CSV Export Function


Conclusion

This project demonstrates a practical approach to automated medical data collection from web search results. Key learnings include:

  1. Text-based extraction is more robust than CSS selectors for dynamic web pages
  2. Strict validation ensures high data quality even if it means fewer entries
  1. Modular design allows easy maintenance and extension
  1. Automated cleaning catches bad entries that slip through initial filtering
  2. Mathematical validation provides quantifiable data quality metrics

The scraper successfully collects validated medical practitioner data from Bielefeld, with 100% of entries containing ratings and review counts. The architecture is designed for extensibility to other cities and data sources.


Document Version: 1.1
Last Updated: May 11, 2026
License: MIT

Created:
5/11/2026
Last Updated:
5/11/2026