Domain and Task
Related Work
3.1. Text mining and NLP research overview
3.2. Text mining and NLP in industry use
4.6. XML parsing, data joining, and risk indices development
Experiment and Demonstration
Discussion
6.1. The ‘industry’ focus of the project
6.2. Data heterogeneity, multilingual and multi-task nature
The text content extracted above may or may not contain relevant information for lots. Therefore, this component deals with the filtering of the text content to identify the relevant content areas for further processing (i.e,. ‘zoning’ to the right content). We focus on two tasks: 1) selecting the relevant pages; 2) selecting the relevant tables. We define ‘relevance’ as containing a description of a lot (lot reference) and/or its items (lot items). And for both, we cast them as a text classification task. For example, in Figure 2, a valid ‘lot reference’ is ‘Lot Number 1’, and it contains two items (row 2 and 3); in Figure 6, the first bullet point in Section 2.1 has a lot reference ‘Lot 1’, with the following text being its only item.
4.3.1. Page selection
Given a page 𝑝, we define 𝑠 ∈ 𝑝 as a sentence from 𝑝, and 𝐵𝑂𝑊 as a function that can be applied to 𝑠 or 𝑝 to convert it into a simple BOW representation applying stopwords removal and lowercasing (lightweight NLP that is language dependent). We create several language-independent features for learning a relevant page classifier such that it can be used for content in any language.
For each word 𝑤 in the domain specificity lexicon 𝑤 ∈ 𝐿, we search the word in 𝐵𝑂𝑊(𝑝) and count its frequency as 𝑐𝑜𝑢𝑛𝑡(𝑤, 𝑝). Then for each of the four metrics introduced above, we create four features (so a total of 16 features): the sum, average, minimum and maximum of the scores, as follows, where ‘metric’ generalises the four metrics explained before (ntf, ndf, ntf-ndf, weirdness). Further, for each of the two domain dictionaries (form and measure), we create four features (so a total of 8 features) using the same equations below, by replacing 𝐿 with a dictionary 𝐷, and setting 𝑚𝑒𝑡𝑟𝑖𝑐(𝑤) = 1 always. Together this gives us a total of 24 ‘page-level’ features.
Finally, we calculate the percentage of sentences that contain at least a number (e.g., indicating a volume/quantity or measurement although can be noise sometimes), and evaluate the percentage of sentences that contain at least one matched word from the lexicon/dictionaries and are calculated as follows (where 𝐿 indicates the domain specificity lexicon but can be replaced by either of the two dictionaries):
In total, we obtain 52 features for a page. Our feature extraction process is arguably language independent, as it depends on counting word frequencies using lexicons/dictionaries and pre-computed scores assigned to the entries in these lexicons/dictionaries. The language dependent processes are the translation of the domain lexicons/dictionaries, and the lightweight NLP processes for which models for different languages are more available. For businesses, these help create more affordable multilingual NLP solutions than other choices such as translating input documents, or using more language dependent models. We will discuss this further in later sections.
In order to learn a classifier that determines if a page is relevant or irrelevant, we asked domain experts to annotate a random collection of pages extracted from the raw TED datasets. For the machine learning algorithms, we compare linear SVM, logistic regression, and Random Forest.
4.3.2. Table selection
Similar to page selection, the goal here is to filter tables that are irrelevant. This is common because tables are used in tender notices to describe not only the lots, but oftentimes award criteria, product specifications, and formatting purposes. In this work, we assume that all tables are ‘horizontal’, that is, their columns define attributes of data instances and their rows contain individual data instances. We define 𝑡 as a table and 𝑟 ∈ 𝑡 and 𝑐 ∈ 𝑡 represents rows and columns in the table. If we consider content in a table as a structure-less page, and concatenate each of its rows or columns into a sequence of words equivalent to a sentence potentially describing an instance, then we can apply the same feature extraction methods explained above. In cases of a row or column span, we duplicate the value across all the rows/columns covered by the span. Therefore, we obtain a total of 80 features as follows:
● 24 ‘page’ level features if we treat the table as a simple flat textual page
● 28 ‘sentence’ level features if we treat the table as a simple flat textual page, and its rows as sentences
● 28 ‘sentence’ level features if we treat the table as a simple flat textual page, and its columns as sentences
For training, our domain experts annotated a collection of tables extracted from the raw TED dataset (to be detailed later) and we compared the same group of machine learning algorithms.
Authors:
(1) Ziqi Zhang*, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP (Ziqi.Zhang@sheffield.ac.uk);
(2) Tomas Jasaitis, Vamstar Ltd., London (Tomas.Jasaitis@vamstar.io);
(3) Richard Freeman, Vamstar Ltd., London (Richard.Freeman@vamstar.io);
(4) Rowida Alfrjani, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP (Rowida.Alfrjani@sheffield.ac.uk);
(5) Adam Funk, Information School, the University of Sheffield, Regent Court, Sheffield, UKS1 4DP (Adam.Funk@sheffield.ac.uk).
This paper is