Problem layout
Historically, organizations have had different departments (e.g. finance, technical operations) with different tasks. These different departments and their business procedures all have evolved over time. These procedures tackle different requirements of their departments, resulting in different views on the same objects. This makes merging databases, for example in the pursuit of a homogeneous and unified asset landscape rather difficult. To account for this, free-text fields are often included.
These free-text fields are usually added manually by humans to provide additional context beyond the simple technical data inside fixed columns. This data cannot be expected to have been considered during the initial creation of the database. Instead, language which has evolved in the organization over time is likely to be found here. Examples would be designations whether the object is an asset or a sub-asset. While not applicable on a global scale, these entries usually carry some local information. This can be leveraged to assist in the merging of the databases. As an example, the assignment of entries of two datasets from the first to the second will be illustrated here, based on a use case encountered. A similar matching approach has also been applied by the Deutsche Bundesbank.
Datasets
For the purpose of this blog, consider two datasets. Dataset A will be designated ‘source’ and contains a list of assets from one business view. These entries include technical categorical data, no explicit geoinformation, but also free-text.
Example table
Source Table
Name | Example values |
Free-text field 1 | Main Street Fixed Installation 1 Main Door AB-XYZ-123 |
Free-text field 2 | Sample Town 1234 Main repair station AB-XYZ-123 None |
Building type | 4567 |
Asset class | 3000 |
Department | ABC |
Asset-ID | 79012345 None |
The second dataset B will be the “target”. It contains a larger number of entries from a second business perspective. For the target dataset, the data is considered to be more homogeneous. It will be used as a future reference database. Apart from technical strings which can be related to the source, geo-information, such as postcode, town and street (number) are available. All entries have a UID which needs to be identified for each ‘source’ entry. A number of entries in the source have already been assigned a UID from the target dataset. They are candidates for a training set. These entries can be the result of establishing the second database sometime during the lifetime of the first or of initial manual assignments. The objective is to find a method which can assign the missing UIDs to the other source entries.
Target Table
Name | Example values |
Building type | 4567 |
Asset class | 3000 |
Post Code | 1234 |
Location | Sample Town |
Street | Main Street, Side Street Block B |
Department | ABC |
Asset-ID | 79012345 |
Challenges in sub-selection
A well-designed approach to the feature engineering can lay the foundation for a machine-learning-based model with a very good performance. Our approach leverages the human-centred information stored in the free-text fields. This free-text carries additional information known in the company, but not accounted for in the creation of the database. To assign entries from the source dataset to the target dataset, a similarity approach is used. While the dataset does represent different views on the assets in question, some information is assumed to be identical to be able to relate these together.
The features used can be generally categorized into the following types:
- categorical strings (e.g. building codes ‘ABAC’)
- categorical integer strings (e.g. Postcodes, phone numbers)
- semi-standardized technical strings (e.g. City names)
- free-text strings (‘Construction site from 2019’)
The primary use of the categorical data may allow for narrowing down the number of possible candidates. However, it may not always be possible to identify a single candidate. Instead, multiple entries might be related. This is where additional information will help.
Especially in multinational/multilingual companies, free-text fields require human understanding. Using direct word-comparison can run into issues of different languages, different styles and degrees of abbreviations etc. This makes accessing the information using simple machine learning or fuzzy matching less efficient and prone to errors. Further, the free-texts are usually relatively short, consisting only of up to a dozen words. Applying fuzzy matching on technical strings will be even more error-prone (consider ‘ABCD’ vs ‘ABAD’). There, Boolean comparison are the more robust and defensive approach, avoiding false assignments.
Feature generation
The first step is to extract as much information as a possible from the free-text fields in the source. This can be performed using regular expressions. Further, a decomposition of composed of technical strings is recommended.
Extraction of the location information can be easily achieved by looking for the characteristic postcode pattern and verifying them against the town names present in the target dataset. Further location information can be extracted by looking for strings which contained different variations of words which designate a street (e.g. “str.”, “str”, “weg”). While this might not be able to extract all streets, the aim is to have a high confidence that the strings and addresses obtained are in fact street addresses. If a street is found, additional filtering close to the street text can be used to identify street numbers.
While the use of regular expressions allows the creation of sparsely populated columns, additional information might still be present in the free-text. This motivates the use of fuzzy-matching logic. However, using it directly on the free-text will not appreciate the truly unstructured free-text. As such, after each extraction, the extracted information should be removed from the free-text field which shortens the free-text string and removes structured text. Such a procedure will also improve the fuzzy matching.
This approach results in several features for which similarity features can be calculated. Depending on the nature of the feature, simple Boolean comparison (e.g. City, Postcode or Department-ID), possible difference comparison (e.g. elevation, performance metrics etc.) and finally fuzzy-matching is possible. The last case can be applied on several possible features, e.g. fuzzy-matching the free-text against free-text but as well for example free-text against categorical strings (in case a categorical string was overlooked or misspelled).
Selecting the right candidate
To reduce the computational load, some pre-filtering is recommended before the calculation of the similarity measure. For example, this can be based on a location category (e.g. region). This is generally a very robust indicator.
These similarity features are then calculated for each source-entry target-candidate pair, which gives assessments on the similarity. This is an indicator for the likeliness for being a true source-target pair. The question remains how relevant each of these similarity measures is for deciding whether a suggested assignment is correct. To decide this, a random forest classifier (RFC) is the natural choice of method. In our use case, the size of the RFC (number of estimators, maximum depth) could be reduced, particularly the number of estimators. The maximum depth did affect the performance in a stronger fashion. This can be understood to affect edge cases, as the depth of the tree might not be enough to consider specific circumstances.
The main concern is to avoid false positives, i.e. erroneously deciding that a source-candidate pair is a matching pair. If there are cases where there is a false negative, this source will have (assuming that no false positive has occurred) no partner. Such cases can be assigned manually later on, after the bulk has been automatically assigned. Assessment of the importance of the features should show a high importance of the categorical data, reflecting the underlying structure of the data. The fuzzy-logic will be of lesser overall importance, but useful for ambiguous cases.
Conclusion
This use case illustrates the way human-based text-entries stored in free-text fields can be leveraged to establish connections between hitherto unconnected data. Instead of relying on simple columns names, informed features can improve the identification of connected datasets. Conversely, less time and work has to be spent on manual assignment of data.