Unlocking table data using open source OCR

This summer we were awarded a small research grant from NASA’s Technology Data and Innovation Division to investigate extracting structured information from scans of engineering documents, and we recently demoed our proof-of-concept app for the project to NASA’s Office of the CIO. Our previous work for the NASA Extra Vehicular Activities (EVA) Office used serverless cloud computing and optical character recognition (OCR) to extract unstructured text, and make documents searchable. For this project, NASA asked us to retrieve structured tabular data from the parts lists in their technical diagrams. Because manual entry of these details is tedious, slow, and error-prone, NASA is looking for software tools to assist human technicians by making this process easier, faster, and more accurate.

After surveying the literature, we came up with several candidate approaches. Though we initially expected to use OCR software to solve the entire problem, we found it was unable to reliably extract all the content from tables it identified in the diagrams. In the end, we came up with a three-step approach combining best-of-breed open-source tools: (1) use techniques from computer vision to identify horizontal and vertical lines; (2) cluster the parallel lines to infer table rows and columns (and, by extension, cells); (3) extract text from the cells using OCR.

With the server-side algorithm identified, we developed a simple, focused UI to help users feed in the images of parts list tables. First, the user selects and uploads a document (Figure 1), which our software converts to an image for display. The user then “lassos” the desired table inside of this image (Figure 2). Finally, the server does the extraction and returns a downloadable CSV which the user can view/edit in Excel, Google Sheets, etc (Figure 3).

Since we can apply our technique to extract text and row, column, and cell relationships from any tabular data source and we can’t post NASA’s sensitive spaceflight hardware diagrams here, we’ll be substituting an engineering diagram we found on the Internet.

Figure 1: user-uploaded diagram

Figure 2: user lassos table

Figure 3: the extracted table text in a spreadsheet

As you can see, the accuracy of text extraction and row, column, cell preservation is outstanding even when starting with a low-resolution, low-contrast scan of a technical drawing.

We’re very happy with the results of this quick proof-of-concept, and look forward to applying it to new data sets and use cases to refine it more. We have some ideas for improving the feature set, and are really interested in comparing and/or combining it with AWS Textract to prepare data sets for domain-specific tabular data extraction AI’s! If you’re interested in scheduling a demo or have suggestions on future directions for this work, please contact us at info@v-studios.com or leave a comment below!