
Extract blood test data from PDF documents that have been OCR'd
- or -
Post a project like this20
£210(approx. $282)
- Posted:
- Proposals: 18
- Remote
- #4477753
- OPPORTUNITY
- Open for Proposals
⭐️⭐️⭐️⭐️⭐️ Award-Winning Digital Marketing Expert | SEO Specialist | Website Design & Shopify Consultant
Full-Stack Web & Mobile App Developer With AI Integration & Automation Expertise
Data Scientist| Python Developer|Machine Learning and AI Expert |Academic writer | Software Developer| GIS Expert| Website Development| Accounting and Bookkeeping|

10749830122754551219475912903820629979750415911181845612225908316202411729491119730683746461
Description
Experience Level: Intermediate
Estimated project duration: 1 day or less
The objective is to build a structured blood test database that allows pathology results to be viewed, edited, filtered, and exported to Excel via a web-based HTML interface. The system stores results in a clean, standardised format so trends can be analysed accurately over time.
Using AI-assisted OCR, I have built a local Python extraction pipeline that converts PDF pathology reports into machine-readable text and inserts structured data into a SQLite database. The majority of blood tests extract correctly, including canonical test name, result value, unit, and reference range.
However, I have reached a specific technical issue with three markers:
• CRP (C-reactive protein)
• ESR
• GLU (Glucose)
The OCR output clearly contains the correct lines, and debug logs confirm they are processed. Yet no rows are inserted for these markers.
The failure appears to occur between canonical matching, numeric extraction, or validation logic.
Current System Architecture
The system runs locally and consists of:
• extraction_core_2.py (main engine)
• Supporting modules for OCR preprocessing, lab dictionary building, regex matching, and validation
• SQLite backend
• Schema-driven canonical lab dictionary
• Controlled fuzzy fallback logic
• HTML viewer for results display and Excel export
Pipeline flow:
Convert PDF to image (pdf2image)
Preprocess
Run Tesseract OCR
Clean and normalise text
Match against canonical lab dictionary
Extract:
canonical test name
numeric result
unit
reference range
Validate
Insert into SQLite
The engine is deterministic and rule-based.
The Specific Problem
Example OCR line:
CRP H 5.2 mg/L 0-5
OCR text is correct.
NUMBER_PATTERN matches.
The canonical dictionary contains the test.
Yet:
Inserted 0 rows from 0126251OrderReport_23B00006604_CRP.pdf
Likely failure points include:
• Canonical containment match failing due to normalisation
• Flag tokens (“H”, “L”) interfering with numeric capture
• Numeric extraction anchored incorrectly
• Validation rejecting due to strict range formatting
• Unit pattern mismatch (e.g. mmol/L)
• Dictionary indexing issue
• Match overridden by another lab name
• Guard conditions too strict
If validation fails, the row is rejected silently.
All other panels extract correctly. The issue appears isolated.
What Is Required
This is not a rebuild.
We do not want:
• Re-architecture
• Experimental AI guessing logic
• Large-scale changes
• Expanded fuzzy matching
We need:
1. Precise Diagnosis
Identify exactly where CRP, ESR, and GLU are failing insertion and which rule is causing rejection.
2. Minimal Safe Fix
Implement a targeted correction that:
• Adjusts canonical matching if required
• Anchors numeric extraction correctly
• Allows flag tokens without blocking capture
• Relaxes only necessary validation checks
• Preserves deterministic behaviour
3. Zero Regression
• No impact to currently working panels
• No performance degradation
• No uncontrolled fuzzy expansion
4. Modular Implementation
If appropriate:
• Implement as small isolated module
or
• Cleanly adjust matching block
The existing architecture should remain intact.
Constraints
The system is designed to be:
• Deterministic
• Schema-driven
• Reproducible
• Forensic-grade
We cannot introduce probabilistic or unpredictable behaviour.
Longer-Term Goal
After stabilising extraction:
• Migrate to web deployment
• Enable structured uploads
• Add trend analysis
• Later incorporate AI-assisted interpretation
Immediate priority:
Stabilise deterministic extraction for CRP, ESR, and GLU without breaking the existing engine.
Materials Provided
Uploaded:
• Full extraction_core_2.py (text format)
• Screenshot of HTML viewer
• Sample PDF files
• Export showing required output
Additional materials available on request:
• Sample OCR blocks
• Canonical dictionary entries
• Regex patterns
• Validation logic
• Database schema
• Debug logs
This is a focused debugging and refinement request. I have spent many hours attempting to isolate the issue and now require an experienced developer to identify the blocking condition and implement a practical fix.
I have been advised this should take 1–2 hours for a senior developer.
Looking for a swift turnaround.
Using AI-assisted OCR, I have built a local Python extraction pipeline that converts PDF pathology reports into machine-readable text and inserts structured data into a SQLite database. The majority of blood tests extract correctly, including canonical test name, result value, unit, and reference range.
However, I have reached a specific technical issue with three markers:
• CRP (C-reactive protein)
• ESR
• GLU (Glucose)
The OCR output clearly contains the correct lines, and debug logs confirm they are processed. Yet no rows are inserted for these markers.
The failure appears to occur between canonical matching, numeric extraction, or validation logic.
Current System Architecture
The system runs locally and consists of:
• extraction_core_2.py (main engine)
• Supporting modules for OCR preprocessing, lab dictionary building, regex matching, and validation
• SQLite backend
• Schema-driven canonical lab dictionary
• Controlled fuzzy fallback logic
• HTML viewer for results display and Excel export
Pipeline flow:
Convert PDF to image (pdf2image)
Preprocess
Run Tesseract OCR
Clean and normalise text
Match against canonical lab dictionary
Extract:
canonical test name
numeric result
unit
reference range
Validate
Insert into SQLite
The engine is deterministic and rule-based.
The Specific Problem
Example OCR line:
CRP H 5.2 mg/L 0-5
OCR text is correct.
NUMBER_PATTERN matches.
The canonical dictionary contains the test.
Yet:
Inserted 0 rows from 0126251OrderReport_23B00006604_CRP.pdf
Likely failure points include:
• Canonical containment match failing due to normalisation
• Flag tokens (“H”, “L”) interfering with numeric capture
• Numeric extraction anchored incorrectly
• Validation rejecting due to strict range formatting
• Unit pattern mismatch (e.g. mmol/L)
• Dictionary indexing issue
• Match overridden by another lab name
• Guard conditions too strict
If validation fails, the row is rejected silently.
All other panels extract correctly. The issue appears isolated.
What Is Required
This is not a rebuild.
We do not want:
• Re-architecture
• Experimental AI guessing logic
• Large-scale changes
• Expanded fuzzy matching
We need:
1. Precise Diagnosis
Identify exactly where CRP, ESR, and GLU are failing insertion and which rule is causing rejection.
2. Minimal Safe Fix
Implement a targeted correction that:
• Adjusts canonical matching if required
• Anchors numeric extraction correctly
• Allows flag tokens without blocking capture
• Relaxes only necessary validation checks
• Preserves deterministic behaviour
3. Zero Regression
• No impact to currently working panels
• No performance degradation
• No uncontrolled fuzzy expansion
4. Modular Implementation
If appropriate:
• Implement as small isolated module
or
• Cleanly adjust matching block
The existing architecture should remain intact.
Constraints
The system is designed to be:
• Deterministic
• Schema-driven
• Reproducible
• Forensic-grade
We cannot introduce probabilistic or unpredictable behaviour.
Longer-Term Goal
After stabilising extraction:
• Migrate to web deployment
• Enable structured uploads
• Add trend analysis
• Later incorporate AI-assisted interpretation
Immediate priority:
Stabilise deterministic extraction for CRP, ESR, and GLU without breaking the existing engine.
Materials Provided
Uploaded:
• Full extraction_core_2.py (text format)
• Screenshot of HTML viewer
• Sample PDF files
• Export showing required output
Additional materials available on request:
• Sample OCR blocks
• Canonical dictionary entries
• Regex patterns
• Validation logic
• Database schema
• Debug logs
This is a focused debugging and refinement request. I have spent many hours attempting to isolate the issue and now require an experienced developer to identify the blocking condition and implement a practical fix.
I have been advised this should take 1–2 hours for a senior developer.
Looking for a swift turnaround.
Gill A.
100% (4)Projects Completed
6
Freelancers worked with
5
Projects awarded
36%
Last project
7 Oct 2024
United Kingdom
New Proposal
Login to your account and send a proposal now to get this project.
Log inClarification Board Ask a Question
-

Hello Gill,
We also need the python modules/files for all enhancements used in the main script -

Hello Gill,
The extraction_core_2.py file requires the file ocr_preprocessor_1.py to run. Could you please share this file?
Moreover I need some sample pdfs to try my solution.
Thank you.
Thanks -

Have you considered using Docling for PDF conversion into Markdown before processing? https://www.docling.ai/
-

Hi Gill,
I have made a setup with fix, but system is dependent on ocr_preprocessor_1.py which is missing in your attached file.
Could you please share the ocr_preprocessor_1.py?
Thanks
Sumit
SaS Technologies
1150516115048711502931150223
We collect cookies to enable the proper functioning and security of our website, and to enhance your experience. By clicking on 'Accept All Cookies', you consent to the use of these cookies. You can change your 'Cookies Settings' at any time. For more information, please read ourCookie Policy
Cookie Settings
Accept All Cookies