DCO-DS Boundary Activity: Data Extraction from Tables and Plots in Scanned PDF Publications Conference Poster uri icon

DCO ID 11121/1505-5451-4325-7891-CC

is Contribution to the DCO

  • NO

year of publication

  • 2014


  • Reusability of data is a point of major importance in scientific research. There are many occasions when we would like to reuse the data in the old publications in the 1960’s or even older. However, the data in those old publications are normally not ready for direct reuse as they are not in the machine readable formats yet. It is very common that the document formats used are not geared toward reusability. A particularly difficult format to reuse is the Portable Document Format (PDF) as it was never designed for this purpose. This DCO boundary activity focused on the task of retrieving data from tables and plots in the scanned pdf publications as efficiently and accurately as possible. Optical character recognition (OCR) is the key technique for this task. It refers to the process of extracting machine characters from input images (usually in the form of scanned documents). A variety of open source programs have been tested for different use cases. There are also some issues remained to be improved which have been listed below.

publication date

  • 2014