Pro­blem layout

His­to­ri­cally, orga­niza­ti­ons have had dif­fe­rent depart­ments (e.g. finance, tech­ni­cal ope­ra­ti­ons) with dif­fe­rent tasks. These dif­fe­rent depart­ments and their busi­ness pro­ce­du­res all have evol­ved over time. These pro­ce­du­res tackle dif­fe­rent requi­re­ments of their depart­ments, resul­ting in dif­fe­rent views on the same objects. This makes mer­ging data­ba­ses, for exam­ple in the pur­suit of a homo­ge­neous and uni­fied asset land­scape rather dif­fi­cult. To account for this, free-text fields are often included.

These free-text fields are usually added manu­ally by humans to pro­vide addi­tio­nal con­text bey­ond the simple tech­ni­cal data inside fixed columns. This data can­not be expec­ted to have been con­side­red during the initial crea­tion of the data­base. Ins­tead, lan­guage which has evol­ved in the orga­niza­tion over time is likely to be found here. Examp­les would be desi­gna­ti­ons whe­ther the object is an asset or a sub-asset. While not appli­ca­ble on a glo­bal scale, these ent­ries usually carry some local infor­ma­tion. This can be lever­a­ged to assist in the mer­ging of the data­ba­ses. As an exam­ple, the assign­ment of ent­ries of two data­sets from the first to the second will be illus­tra­ted here, based on a use case encoun­te­red. A simi­lar matching approach has also been applied by the Deut­sche Bun­des­bank.

Data­sets

For the pur­pose of this blog, con­sider two data­sets. Data­set A will be desi­gna­ted ‘source’ and con­ta­ins a list of assets from one busi­ness view. These ent­ries include tech­ni­cal cate­go­ri­cal data, no expli­cit geo­in­for­ma­tion, but also free-text.

Exam­ple table

Source Table
NameExam­ple values
Free-text field 1Main Street
Fixed Instal­la­tion 1
Main Door
AB-XYZ-123
Free-text field 2Sam­ple Town 1234
Main repair sta­tion
AB-XYZ-123
None
Buil­ding type4567
Asset class3000
Depart­mentABC
Asset-ID79012345
None

The second data­set B will be the “tar­get”. It con­ta­ins a lar­ger num­ber of ent­ries from a second busi­ness per­spec­tive. For the tar­get data­set, the data is con­side­red to be more homo­ge­neous. It will be used as a future refe­rence data­base. Apart from tech­ni­cal strings which can be rela­ted to the source, geo-infor­ma­tion, such as post­code, town and street (num­ber) are available. All ent­ries have a UID which needs to be iden­ti­fied for each ‘source’ entry. A num­ber of ent­ries in the source have alre­ady been assi­gned a UID from the tar­get data­set. They are can­di­da­tes for a trai­ning set. These ent­ries can be the result of estab­li­shing the second data­base some­time during the life­time of the first or of initial manual assign­ments. The objec­tive is to find a method which can assign the miss­ing UIDs to the other source entries.

Tar­get Table
NameExam­ple values
Buil­ding type4567
Asset class3000
Post Code1234
Loca­tionSam­ple Town
StreetMain Street, Side Street Block B
Depart­mentABC
Asset-ID79012345

Chal­lenges in sub-selection

A well-desi­gned approach to the fea­ture engi­nee­ring can lay the foun­da­tion for a machine-lear­ning-based model with a very good per­for­mance. Our approach lever­a­ges the human-cent­red infor­ma­tion stored in the free-text fields. This free-text car­ries addi­tio­nal infor­ma­tion known in the com­pany, but not accoun­ted for in the crea­tion of the data­base. To assign ent­ries from the source data­set to the tar­get data­set, a simi­la­rity approach is used. While the data­set does repre­sent dif­fe­rent views on the assets in ques­tion, some infor­ma­tion is assu­med to be iden­ti­cal to be able to relate these together.

The fea­tures used can be gene­rally cate­go­ri­zed into the fol­lo­wing types:

  • cate­go­ri­cal strings (e.g. buil­ding codes ‘ABAC’)
  • cate­go­ri­cal inte­ger strings (e.g. Post­codes, phone numbers)
  • semi-stan­dar­di­zed tech­ni­cal strings (e.g. City names)
  • free-text strings (‘Con­s­truc­tion site from 2019’)

The pri­mary use of the cate­go­ri­cal data may allow for nar­ro­wing down the num­ber of pos­si­ble can­di­da­tes. Howe­ver, it may not always be pos­si­ble to iden­tify a sin­gle can­di­date. Ins­tead, mul­ti­ple ent­ries might be rela­ted. This is where addi­tio­nal infor­ma­tion will help.

Espe­ci­ally in multinational/multilingual com­pa­nies, free-text fields require human under­stan­ding. Using direct word-com­pa­ri­son can run into issues of dif­fe­rent lan­guages, dif­fe­rent styles and degrees of abbre­via­ti­ons etc. This makes acces­sing the infor­ma­tion using simple machine lear­ning or fuzzy matching less effi­ci­ent and prone to errors. Fur­ther, the free-texts are usually rela­tively short, con­sis­ting only of up to a dozen words. App­ly­ing fuzzy matching on tech­ni­cal strings will be even more error-prone (con­sider ‘ABCD’ vs ‘ABAD’). There, Boolean com­pa­ri­son are the more robust and defen­sive approach, avo­i­ding false assignments.

Fea­ture generation

The first step is to extract as much infor­ma­tion as a pos­si­ble from the free-text fields in the source. This can be per­for­med using regu­lar expres­si­ons. Fur­ther, a decom­po­si­tion of com­po­sed of tech­ni­cal strings is recommended.

Extra­c­tion of the loca­tion infor­ma­tion can be easily achie­ved by loo­king for the cha­rac­te­ristic post­code pat­tern and veri­fy­ing them against the town names pre­sent in the tar­get data­set. Fur­ther loca­tion infor­ma­tion can be extra­c­ted by loo­king for strings which con­tai­ned dif­fe­rent varia­ti­ons of words which desi­gnate a street (e.g. “str.”, “str”, “weg”). While this might not be able to extract all streets, the aim is to have a high con­fi­dence that the strings and addres­ses obtai­ned are in fact street addres­ses. If a street is found, addi­tio­nal fil­te­ring close to the street text can be used to iden­tify street numbers.

Connecting assets across different datasets using free-text fields Bild3
Figure 1 Extra­c­tion pro­cess to gene­rate new features.

While the use of regu­lar expres­si­ons allows the crea­tion of spar­sely popu­la­ted columns, addi­tio­nal infor­ma­tion might still be pre­sent in the free-text. This moti­va­tes the use of fuzzy-matching logic. Howe­ver, using it directly on the free-text will not app­re­ciate the truly unstruc­tu­red free-text. As such, after each extra­c­tion, the extra­c­ted infor­ma­tion should be remo­ved from the free-text field which shor­tens the free-text string and remo­ves struc­tu­red text. Such a pro­ce­dure will also improve the fuzzy matching.

This approach results in seve­ral fea­tures for which simi­la­rity fea­tures can be cal­cu­la­ted. Depen­ding on the nature of the fea­ture, simple Boolean com­pa­ri­son (e.g. City, Post­code or Depart­ment-ID), pos­si­ble dif­fe­rence com­pa­ri­son (e.g. ele­va­tion, per­for­mance metrics etc.) and finally fuzzy-matching is pos­si­ble. The last case can be applied on seve­ral pos­si­ble fea­tures, e.g. fuzzy-matching the free-text against free-text but as well for exam­ple free-text against cate­go­ri­cal strings (in case a cate­go­ri­cal string was over­loo­ked or misspelled).

Sel­ec­ting the right candidate

To reduce the com­pu­ta­tio­nal load, some pre-fil­te­ring is recom­men­ded before the cal­cu­la­tion of the simi­la­rity mea­sure. For exam­ple, this can be based on a loca­tion cate­gory (e.g. region). This is gene­rally a very robust indicator.

Connecting assets across different datasets using free-text fields Bild4
Figure 2 Pro­cess for iden­ti­fi­ca­tion of sui­ta­ble candidates

These simi­la­rity fea­tures are then cal­cu­la­ted for each source-entry tar­get-can­di­date pair, which gives assess­ments on the simi­la­rity. This is an indi­ca­tor for the likeli­ness for being a true source-tar­get pair. The ques­tion remains how rele­vant each of these simi­la­rity mea­su­res is for deci­ding whe­ther a sug­gested assign­ment is cor­rect. To decide this, a ran­dom forest clas­si­fier (RFC) is the natu­ral choice of method. In our use case, the size of the RFC (num­ber of esti­ma­tors, maxi­mum depth) could be redu­ced, par­ti­cu­larly the num­ber of esti­ma­tors. The maxi­mum depth did affect the per­for­mance in a stron­ger fashion. This can be unders­tood to affect edge cases, as the depth of the tree might not be enough to con­sider spe­ci­fic circumstances.

The main con­cern is to avoid false posi­ti­ves, i.e. erro­n­eously deci­ding that a source-can­di­date pair is a matching pair. If there are cases where there is a false nega­tive, this source will have (assum­ing that no false posi­tive has occur­red) no part­ner. Such cases can be assi­gned manu­ally later on, after the bulk has been auto­ma­ti­cally assi­gned. Assess­ment of the importance of the fea­tures should show a high importance of the cate­go­ri­cal data, reflec­ting the under­ly­ing struc­ture of the data. The fuzzy-logic will be of les­ser over­all importance, but useful for ambi­guous cases.

Con­clu­sion

This use case illus­tra­tes the way human-based text-ent­ries stored in free-text fields can be lever­a­ged to estab­lish con­nec­tions bet­ween hitherto uncon­nec­ted data. Ins­tead of rely­ing on simple columns names, infor­med fea­tures can improve the iden­ti­fi­ca­tion of con­nec­ted data­sets. Con­ver­sely, less time and work has to be spent on manual assign­ment of data.