Back to projectsCOMPUTER VISION

7x Better Than Google, Amazon, and Microsoft

Built a custom OCR and page segmentation engine that turned weeks of manual investigation into minutes — processing historical documents spanning over a century.

Client Type

Nonprofit (Child Safety)

Timeline

TBC

Role

Lead AI Engineer

— THE CHALLENGE

A child safety organization needed to trace decades of personnel movements across thousands of historical documents — by hand.

The client investigates institutional abuse spanning over a century. Their caseworkers were manually reading through historical organizational directories — documents with degraded print, inconsistent layouts, and no digital structure.

Tracing a single individual's movements across postings took an investigator roughly two weeks. With thousands of cases to process, the bottleneck wasn't willpower — it was physics. No commercial OCR solution could handle the document quality. Google, Amazon, and Microsoft's OCR services all failed on the degraded historical print, producing unusable output.

7x

More accurate than Google, Amazon, and Microsoft OCR

2 min

To trace a subject’s full movement history (previously 2 weeks)

100+

Years of historical documents processed

— THE APPROACH
01

Custom Page Segmentation

Built a morphology-based page segmentation engine to handle the inconsistent layouts of historical directories. Unlike off-the-shelf tools, this could parse multi-column, degraded pages with varying fonts and print quality across different decades.

02

Custom DNN-Based OCR

Designed and trained a deep neural network combining LSTM, CNN, and CTC architectures specifically for degraded historical print. This wasn’t fine-tuning an existing model — it was a ground-up build optimized for document types that commercial solutions couldn’t read.

03

Pattern Extraction Pipeline

Built an automated extraction layer that could identify individuals and map their movements across postings, locations, and dates spanning decades. This turned raw OCR output into structured, searchable intelligence.

04

Investigator Workflow Integration

Delivered a tool that caseworkers could use directly — reducing a two-week manual investigation to under two minutes. This also enabled pattern extrapolation across multiple cases, something that was previously impossible to do manually at scale.

— THE RESULTS

BEFORE

  • Commercial OCR (Google, Amazon, Microsoft) produced unusable output on historical documents
  • Two weeks per subject to manually trace movement history
  • Pattern analysis across multiple cases was practically impossible
  • Investigators bottlenecked on document review, not actual investigation

AFTER

  • Custom OCR outperformed all three major cloud providers by 7x on accuracy
  • Full subject movement history extracted in under 2 minutes
  • Cross-case pattern extrapolation automated for the first time
  • Project featured at the United Nations Convention in Geneva

Have a similar challenge?

I help teams design, build, and ship AI systems that work in production. Let's talk about your problem.