ReviewProof
This project is a Python-based web scraping system designed to collect comprehensive data about local businesses in Bielefeld, Germany - including restaurants, medical practitioners, cafes, shops, and more. The scraper uses Playwright (a modern browser automation library) to interact with Google Search and extract business information including ratings, review counts, addresses, phone numbers, and GDPR-related deletion notices.

Key Objectives:
- Collect data on 500+ businesses across multiple categories
- Capture ratings, review counts, addresses, and deleted review notices
- Maintain data quality through strict validation rules
- Build a reusable, maintainable scraper architecture
- Support multiple business types: restaurants, doctors, shops, cafes
Architecture
System Overview
Component Architecture
Data Pipeline
End-to-End Flow
Phase 1: Search & Collection
The scraper uses a multi-term search strategy to maximize coverage across different business types:
Each term searches 3 pages times 10 results equals 30 potential entries per term. With 6 search terms, that is up to 180 potential entries.

Phase 2: Name Extraction
Phase 3: Individual Scraping
Scraping Methodology
Browser Setup
Key configurations:
headless=False: For visual debuggingAutomationControlledflag: Bypass bot detectionno-sandbox: Required for some Linux environments
Search URL Pattern
| Parameter | Value | Purpose |
|-----------|-------|---------|
| q | Search term | Google query |
| tbm=lcl | local | Local business results |
| hl=de-DE | German | German language results |
| start | 0, 10, 20 | Pagination |
Consent Handling
Data Extraction Patterns
Rating Pattern
Address Pattern
Deleted Reviews (GDPR)
Data Validation & Cleaning
Validation Rules
Filtering Math
The name filtering algorithm uses precise mathematical constraints. It validates names between 8-70 characters and uses keyword set operations to ensure only valid business names are included. Fuzzy matching with Levenshtein distance helps detect and remove duplicate entries.
Business Keywords (Must Match)
The filtering system supports multiple business categories:
Rejection Patterns
Clean Categories
The system auto-categorizes businesses into:
Medical:
- Doctors (Dr., Dr. med., Prof.)
- Practices (Praxis, Gemeinschaftspraxis)
- Clinics (Klinik, Krankenhaus)
- Dentists (Zahnarzt, Zahnärzte)
- Specialists (Facharzt, Centrum)
- Therapy centers
Restaurants & Food:
- Restaurants (Restaurant, Gastronomie)
- Cafes (Café, Konditorei)
- Fast food (Pizza, Döner, Grill)
- Bakeries (Bäckerei)
Retail & Services:
- Shops and stores
- Electronics (Elektro, Elektronik)
- Fashion (Mode, Boutique)
- Markets (Markt)

Statistical Formulas
Data quality is measured using statistical metrics:
-
Mean rating: mean = SUM(x) / n
-
Standard deviation: std = sqrt(SUM((x-mean)^2)/n)
-
Data completeness rates:
- Rating completeness: R = count(rating) / total * 100%
- Address completeness: A = count(address) / total * 100%
- Deleted reviews: D = count(deleted) / total * 100%
-
Rating histogram bins: | Bin | Range | Count | |-----|-------|-------| | 1 | 4.5-5.0 | 89 | | 2 | 4.0-4.4 | 45 | | 3 | 3.5-3.9 | 18 | | 4 | under 3.5 | 5 |
-
Category distribution percentages:
- Doctor: 62/157 = 39.5%
- Dentist: 35/157 = 22.3%
- Clinic: 28/157 = 17.8%
- Medical Practice: 20/157 = 12.7%
- Specialist: 12/157 = 7.6%
Timing Calculations
Performance timing is calculated using:
-
Page load time estimates:
- Average load: t_load = 1.8s +/- 0.5s
- Timeout threshold: t_max = 10s
-
Random delay distribution: U(1.0, 2.0) seconds between requests
-
Total runtime formula:
For 157 entries: T = 157 * (1.8 + 0.3 + 0.1) + 156 * 1.5 = 680s = 11 minutes
Performance Analysis
| Operation | Time Complexity | Description | |-----------|----------------|-------------| | Name filtering | O(n) | Linear scan of token set | | Fuzzy matching | O(n * m) | Levenshtein on all pairs | | Rating extraction | O(1) | Single regex match | | DB insert/update | O(1) | Hash table lookup | | CSV export | O(n) | Full table scan |
-
Memory estimation:
- Raw HTML buffer: ~5MB per page
- Name set: ~50KB for 1000 names
- SQLite DB: ~500KB for 157 entries
- Total peak: ~10MB
-
Throughput calculation:
Advanced Calculations
-
Haversine distance (for future map visualization):
-
Google Maps coordinate extraction:
-
Quality scoring algorithm (0-100):
Database Schema
SQLite Schema
Entity Relationship
Field Descriptions
| Field | Type | Description | Example | |-------|------|-------------|---------| | id | INTEGER | Primary key | 1 | | name | TEXT | Business name | "Dr. med. Hans Müller" | | rating | TEXT | Rating (1.0-5.0) | "4.5" | | total_reviews | TEXT | Number of reviews | "103" | | deleted_reviews | TEXT | GDPR deletion notice | "Einige Ergebnisse..." | | address | TEXT | Full address | "Adresse: Hauptstr. 1, 33602 Bielefeld" | | url | TEXT | Google Maps URL | "https://www.google.com/maps/..." | | category | TEXT | Auto-tagged category | "Doctor", "Dentist", "Clinic" | | scrape_date | TEXT | First scrape date | "2026-05-11" | | last_updated | TEXT | Last update date | "2026-05-11" | | status | TEXT | Active/inactive | "active" |
Category Auto-Tagging
Data Flow Diagrams
Complete Pipeline
Error Handling Flow
Challenges & Solutions
Challenge 1: Dynamic Google UI
Problem: Google frequently changes their HTML structure and CSS class names.
Solution: Use text-based extraction instead of CSS selectors:
Before (Selector-based):
After (Text-based):
Challenge 2: Consent Cookie Banners
Problem: Google shows cookie consent overlay that blocks content.
Solution: Automated button clicking with multiple attempts:
Challenge 3: Non-Medical Entries
Problem: Search results include gyms, pharmacies, opticians.
Solution: Strict filtering with keyword validation:
Challenge 4: Duplicate Entries
Problem: Same clinic appears under different names (e.g., "EvKB Haus Gilead I" vs "EvKB Haus Gilead IV").
Solution: Database-level dedup with name matching:
Challenge 5: Missing Rating Data
Problem: Some entries don't display ratings in search snippets.
Solution: Strict validation - only save entries with complete data:

Results & Statistics
Current Dataset
| Metric | Value | |--------|-------| | Total Entries | 157 | | With Ratings | 157 (100%) | | With Reviews | 157 (100%) | | With Deleted Reviews | 58 (37%) | | With Addresses | 16 (10%) |
Rating Distribution
Category Breakdown
| Category | Count | Percentage | |----------|-------|------------| | Doctor | 62 | 39% | | Dentist | 35 | 22% | | Clinic | 28 | 18% | | Medical Practice | 20 | 13% | | Specialist | 12 | 8% |
Deleted Reviews Analysis

Technical Stack
Dependencies
File Structure

Future Enhancements
Planned Improvements
-
Parallel Processing
- Use multiple browser contexts simultaneously
- Reduce total scraping time by 50%
-
Enhanced Validation
- Fuzzy name matching for duplicate detection
- URL-based deduplication as backup
- Address Extraction
- Improve regex patterns for German addresses
- Parse multiple address formats
-
Review Content
- Collect actual review text (with consent)
- Sentiment analysis on reviews
-
Monitoring
- Track changes over time (re-scrape detection)
- Alert on rating changes
Potential Extensions
Appendix: Code Reference
Main Loop Structure
CSV Export Function
Conclusion
This project demonstrates a practical approach to automated medical data collection from web search results. Key learnings include:
- Text-based extraction is more robust than CSS selectors for dynamic web pages
- Strict validation ensures high data quality even if it means fewer entries
- Modular design allows easy maintenance and extension
- Automated cleaning catches bad entries that slip through initial filtering
- Mathematical validation provides quantifiable data quality metrics
The scraper successfully collects validated medical practitioner data from Bielefeld, with 100% of entries containing ratings and review counts. The architecture is designed for extensibility to other cities and data sources.
Document Version: 1.1
Last Updated: May 11, 2026
License: MIT