Overview¶

What’s Record Linkage¶

Record linkage (RL) is the task of finding records in a data set that refer to the same entity across different data sources (e.g., data files, books, websites, and databases). – Wikipedia

Assume we have two following tables:

It’s obvious that both of them have identical id A04 and these two belong to the same entity. Then we know that “David’s id is A04 and he is a male”.

But real world situations are more complex:

Typos: “Joh” vs “John”
OCR errors: “J0hn” vs “John”
Formatting conventions: “03/17” vs “March 17”
Abbreviations: “J. K. Rowling” vs “Joanne Rowling”
Nick names: “John” vs “Jock”
Word order: “Sargent, John S.” vs “John S. Sargent”

Record Linkage Toolkit (RLTK) is designed to give a easy to use, scalable and extensible way of resolving these problems.

Basic Components & Data Flow¶

The first step of using RLTK is to create Dataset.

In RLTK, every “row” in table is called Record. Notice “row” here is a logical concept: for a csv file it’s a csv object (can be formed by multiple lines); for a json lines file, it’s a line object; for a query from DBMS, it’s one row; for some self-defined stream, it’s a minimal entity.

Dataset is a “table”, or more precise, a collection of Record s.

Reader is used to handle heterogeneous input. For each “row”, the return of the Reader is represented in a Python dictionary, named raw_object (shown as grey {...} in figure). Record is generated based on raw_object: user need to extend base Record class and add properties for later usage.

KeyValueAdapter is where all Record s get stored. It can be either in memory or persistent.

So, the data flow is: in order to create Dataset, use Reader to read from input source and convert entity by entity to raw_object which is used to construct Record, then store Record in KeyValueAdapter.

Obviously, generating Record is really time consuming if the dataset is large, but if the concrete KeyValueAdapter is a persistent one (e.g., HBaseKeyValueAdapter), then next time, Dataset can be loaded directly from this KeyValueAdapter instead of regenerating again from raw input.

Minimal Workflow & Implementation¶

Now we have two Datasets and we need to find pairs (If it’s de-duplication problem, only one Dataset is needed).

Simply using RLTK’s API to get all possible combinations of candidate pairs and implement your only “magical function” (shown as is_pair() in figure) to find if two Record s are the same.

Let’s look at example input datasets and minimal implementation.

import rltk

class Record1(rltk.Record):
   @property
   def id(self):
      return self.raw_object['doc_id']

   @property
   def value(self):
      return self.raw_object['doc_value']

class Record2(rltk.Record):
   @rltk.cached_property
   def id(self):
      return self.raw_object['ident']

   @rltk.cached_property
   def value(self):
      v = self.raw_object.get('values', list())
      return v[0] if len(v) > 0 else 'empty'


ds1 = rltk.Dataset(reader=rltk.CSVReader('ds1.csv'),
                     record_class=Record1, adapter=rltk.MemoryKeyValueAdapter())
ds2 = rltk.Dataset(reader=rltk.JsonLinesReader('ds2.jl'),
                     record_class=Record2, adapter=rltk.DbmKeyValueAdapter('file_index'))

pairs = rltk.get_record_pairs(ds1, ds2)
for r1, r2 in pairs:
   print('-------------')
   print(r1.id, r1.value, '\t', r2.id, r2.value)
   print('levenshtein_distance:', rltk.levenshtein_distance(r1.value, r2.value))
   print('levenshtein_similarity:', rltk.levenshtein_similarity(r1.value, r2.value))

One thing to notice here: the property in Record class can be decorated by @property, or @rltk.cached_property which pre-calculates the value instead of computing at the runtime.

For the “magical function”, you can use any methods that make sense: hand-crafted rules, machine learning model, etc. RLTK provides a lot of similarity metrics which can be very helpful while doing comparison.

Evaluation¶

After designing the “magical function”, you need a way to judge its performance. RLTK has a built-in package called Evaluation which includes three basic components:

Groud Truth: Ground truth data.
Trial: Store the result of prediction of candidate pairs.
Evaluation: Visualize the result of evaluation if multiple trials are given.

_images/overview-evaluation-workflow.png

As can be seen from the figure, every Trial has a corresponding GroundTruth. GroundTruth needs to be provided while generating candidate pairs. Add prediction result to Trial if it needs to be evaluated later. Call evaluate() to get the evaluation of the Trial against GroundTruth.

gt = rltk.GroundTruth()
gt.load('gt.csv')
eva = rltk.Evaluation()
trial = rltk.Trial(ground_truth=gt)

test_pairs = rltk.get_record_pairs(ds1, ds2, ground_truth=gt):
for r1, r2 in test_pairs:
   is_positive = is_pair(r1, r2)
   trial.add_result(r1, r2, is_positive)

trial.evaluate()
print(trial.true_positives, trial.false_positives, trial.true_negatives, trial.false_negatives,
       trial.precision, trial.recall, trial.f_measure)

Notice add_positive() and add_negative() are just syntactic sugar of add_result() used in above code snippet.

Blocking¶

When finding pairs between two datasets, how many total comparison does it make?

Let’s say the 1st dataset has M items and and 2nd has N, then it needs M*N comparisons. If M=10,000, N=100,000, M*N=1,000,000,000. If the computer can determine a heavy is_pair() in 0.001s, in total it costs 1 billion x 0.001s / 60 / 60 / 24 = 11.57 days. Apparently exhausting is not a good choice. Blocking is something invented to tackle this problem. Blocking attempts to restrict comparisons to just those records for which one or more particularly discriminating identifiers agree, which has the effect of increasing the positive predictive value (precision) at the expense of sensitivity (recall).

For example: Full comparison (cross product) of two tables (shown in figure) is 12 times. After inspection, it’s obvious to say that “last name” can be used as blocking key (group by based on key) since people who have different last name can’t be the same. Then, total comparison drops to 3 times.

Blocks need to be calculated and passed while generating candidate pairs. Blocks’ calculation can be time consuming so RLTK supports dumping them to disk for further usage.

def get_first_name(r2):
   return r2.full_name.split(' ')[0]

bg = rltk.HashBlockGenerator()
block = bg.generate(
               bg.block(ds1, property_='first_name'),
               bg.block(ds2, function_=get_first_name)),
pairs = rltk.get_record_pairs(ds1, ds2, block=block)
for r1, r2 in pairs:
    print(r1.id, r1.full_name, '\t', r2.id, r2.full_name)

Summary¶

Now you should know what’s the goal of record linkage, how to construct Dataset and how to use it in RLTK workflow, how to evaluate the quality of linkage and how to use blocking technique to deal with large datasets.