7x Better Than Google, Amazon, and Microsoft
Built a custom OCR and page segmentation engine that turned weeks of manual investigation into minutes — processing historical documents spanning over a century.
Client Type
Nonprofit (Child Safety)
Timeline
TBC
Role
Lead AI Engineer
A child safety organization needed to trace decades of personnel movements across thousands of historical documents — by hand.
The client investigates institutional abuse spanning over a century. Their caseworkers were manually reading through historical organizational directories — documents with degraded print, inconsistent layouts, and no digital structure.
Tracing a single individual's movements across postings took an investigator roughly two weeks. With thousands of cases to process, the bottleneck wasn't willpower — it was physics. No commercial OCR solution could handle the document quality. Google, Amazon, and Microsoft's OCR services all failed on the degraded historical print, producing unusable output.
7x
More accurate than Google, Amazon, and Microsoft OCR
2 min
To trace a subject’s full movement history (previously 2 weeks)
100+
Years of historical documents processed
Custom Page Segmentation
Built a morphology-based page segmentation engine to handle the inconsistent layouts of historical directories. Unlike off-the-shelf tools, this could parse multi-column, degraded pages with varying fonts and print quality across different decades.
Custom DNN-Based OCR
Designed and trained a deep neural network combining LSTM, CNN, and CTC architectures specifically for degraded historical print. This wasn’t fine-tuning an existing model — it was a ground-up build optimized for document types that commercial solutions couldn’t read.
Pattern Extraction Pipeline
Built an automated extraction layer that could identify individuals and map their movements across postings, locations, and dates spanning decades. This turned raw OCR output into structured, searchable intelligence.
Investigator Workflow Integration
Delivered a tool that caseworkers could use directly — reducing a two-week manual investigation to under two minutes. This also enabled pattern extrapolation across multiple cases, something that was previously impossible to do manually at scale.
BEFORE
- Commercial OCR (Google, Amazon, Microsoft) produced unusable output on historical documents
- Two weeks per subject to manually trace movement history
- Pattern analysis across multiple cases was practically impossible
- Investigators bottlenecked on document review, not actual investigation
AFTER
- Custom OCR outperformed all three major cloud providers by 7x on accuracy
- Full subject movement history extracted in under 2 minutes
- Cross-case pattern extrapolation automated for the first time
- Project featured at the United Nations Convention in Geneva
Have a similar challenge?
I help teams design, build, and ship AI systems that work in production. Let's talk about your problem.