OCR Automations to Map your PDF Data
There is an old joke about how data stored in PDF format is where data goes to die. While PDF formats are good for basic purposes, data trapped in a PDF and outside your reach are not helping your business to succeed in challenging times.
Business documents with an unrealized spatial component (hint: most of them) abound in PDF format. When a governmental organization issues a deed, lease or other legal title instrument, there are paper copies (ironically often originally in digital) which are then re-scanned and end up in a document storage system, sometimes just directories full of PDFs. Many of these PDFs are lost to the processes of business for lack of reasonable naming conventions making retrieval difficult. Addressing that topic alone could fill volumes. Other candidate document sources include work orders, invoices and receipts.
Optical Character Recognition technology (OCR) has been around for a long time, which basically turns scanned scanned documents into text. By itself, this technology typically yields piles of unstructured, jumbled text. PDFs containing special characters related to mapping such as coordinates, bearings and distances do not generally translate well with unguided OCR engines such as that commonly found in the consumer grade market.
What if software could be trained to extract legal descriptions in a more structured way, then passed to GIS software for immediate mapping? In addition, why not simultaneously extract desirable information from the scans including vendors, amounts, dates, rents, contract terms, stipulations, etc and store them to a database linked to the scan or GIS database? Better data reduces risks and enhances business opportunities.
A tremendous amount of time is currently spent manually inputting coordinates from scanned input documents such as PDF into a mapping system. Improvements to this process save huge amounts of time resulting in large cost savings not to mention reduced time to build a superior Enterprise GIS.
We don’t believe staff will be “replaced” with this automation, but that they will be freed to work on higher value tasks such as building, checking and analyzing the resulting data for strategies and opportunities.
As an old saying goes, the way to eat an elephant is one bite at a time, and trapped business data in scanned documents is an absolutely enormous elephant.