paint-brush
How to Convert Different Data Formats into Universal JSON with VUDby@textmining

How to Convert Different Data Formats into Universal JSON with VUD

by Text Mining3mDecember 24th, 2024
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Vamstar uses tools like Apache Tika, Tesseract, and PDFPlumber to convert Word, Excel, and PDF files into a standardized JSON format (VUD), extracting key structured data like paragraphs and tables.
featured image - How to Convert Different Data Formats into Universal JSON with VUD
Text Mining HackerNoon profile picture
0-item
  1. Abstract and Introduction

  2. Domain and Task

    2.1. Data sources and complexity

    2.2. Task definition

  3. Related Work

    3.1. Text mining and NLP research overview

    3.2. Text mining and NLP in industry use

    3.3. Text mining and NLP for procurement

    3.4. Conclusion from literature review

  4. Proposed Methodology

    4.1. Domain knowledge

    4.2. Content extraction

    4.3. Lot zoning

    4.4. Lot item detection

    4.5. Lot parsing

    4.6. XML parsing, data joining, and risk indices development

  5. Experiment and Demonstration

    5.1. Component evaluation

    5.2. System demonstration

  6. Discussion

    6.1. The ‘industry’ focus of the project

    6.2. Data heterogeneity, multilingual and multi-task nature

    6.3. The dilemma of algorithmic choices

    6.4. The cost of training data

  7. Conclusion, Acknowledgements, and References

4.2. Content extraction

In this component, our goal is to convert heterogeneous data file formats (Word, Excel, PDF, etc) into a universal, machine accessible JSON format, which we refer to as VUD. For each file, depending on its format, we use the corresponding APIs (e.g., Apache Tika for Word and Excel files, Apache Tesseract and PDFPlumber for PDF). However, not all APIs or data files support the extraction of formatting features (e.g., font size, colour, header level), especially if a document is not structurally tagged. Therefore, we focus on extracting only the text contents and a limited range of structure information. Our VUD documents contains the following structured content:


● Pages, corresponding to the pages in the source document;


● Paragraphs, identified as a consecutive block of texts separated by at least two new line characters. E.g., in Figure 1, ‘North Macedonia-Strumica: Pharmaceutical products’ and ‘2020/S 050-119757’ in the title will be treated as separate paragraphs, while ‘Legal Basis: Directive 2014/24/EU’ is a single paragraph;


● Tables, extracted using the Camelot and Tabula libraries, contain basic tabular structures including column and row indices (including column and row span information), and the textual content within each cell. Tables across multiple pages are recognised as separate tables initially, but are then merged if they meet the following simple rules: 1) there are no other non-tabular structures between two tables; 2) they have the same number of columns.


Authors:

(1) Ziqi Zhang*, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP (Ziqi.Zhang@sheffield.ac.uk);

(2) Tomas Jasaitis, Vamstar Ltd., London (Tomas.Jasaitis@vamstar.io);

(3) Richard Freeman, Vamstar Ltd., London (Richard.Freeman@vamstar.io);

(4) Rowida Alfrjani, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP (Rowida.Alfrjani@sheffield.ac.uk);

(5) Adam Funk, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP (Adam.Funk@sheffield.ac.uk).


This paper is available on arxiv under CC BY 4.0 license.