Step-by-step example

In this tutorial, you will go through an example end to end. Here are the main steps you will go through:

  • Dataset analysis
  • Construct RLTK datasets
  • Blocking
  • Pairwise comparison
  • Evaluation

Dataset analysis

The dataset used here is an artificial dataset which contructed from DBLP and Scholar data. Let’s take a look.

[1]:
# initialization
import os
from datetime import datetime
import pandas as pd
from IPython.display import display
import rltk
[2]:
df_dblp = pd.read_csv('resources/dblp.csv', parse_dates=False)
df_dblp.head()
[2]:
id names date
0 journals/sigmod/HummerLW02 W Hümmer, W Lehner, H Wedekind 2018-12-24
1 conf/vldb/AgrawalS94 R Agrawal, R Srikant 2018-12-22
2 conf/vldb/Brin95 S Brin 2018-12-26
3 conf/vldb/ChakravarthyKAK94 S Chakravarthy, V Krishnaprasad, E Anwar, S Kim 2018-12-29
4 conf/vldb/MedianoCD94 M Mediano, M Casanova, M Dreux 2018-12-26
[3]:
df_scholar = pd.read_json('resources/scholar.jl', lines=True, convert_dates=False)
df_scholar.head()
[3]:
date id names
0 26, Dec 2018 ek26aiEheesJ M Fernandez, J Kang, A Levy, D Suciu
1 29, Dec 2018 rmtEGXAXHKIJ S Adali, KS Candan, Y Papakonstantinou, VS
2 27, Dec 2018 D0z0BDnbnFcJ S Christodoulakis
3 28, Dec 2018 noTo81QxmHQJ ACMS Anthology, P Edition
4 28, Dec 2018 l0W27c1C3NwJ W Litwin, MA Neimat, DA Schneider

By a glance, it’s easy to find out that both datasets have id, date and names.

  • Dates have different formats
  • Names columns contains many names separated by comma.

Construct RLTK datasets

In RLTK, the data collection is named Dataset and each “row” is a Record instance. In order to construct a Dataset, you need to read data from source by a specific Reader, then the data is presented in a Python Dict raw_object which can be use to construct Record instance by the schema (concrete class of Record) you definded.

For DBLP:

[4]:
class DBLP(rltk.Record):
    @property
    def id(self):
        return self.raw_object['id']

    @property
    def date(self):
        return self.raw_object['date']

    @property
    def names(self):
        return list(map(lambda x: x.strip(), self.raw_object['names'].split(',')))
[5]:
ds_dblp = rltk.Dataset(rltk.CSVReader('resources/dblp.csv'), record_class=DBLP)

for r_dblp in ds_dblp.head():
    print(r_dblp.id, r_dblp.date, r_dblp.names)
journals/sigmod/HummerLW02 2018-12-24 ['W Hümmer', 'W Lehner', 'H Wedekind']
conf/vldb/AgrawalS94 2018-12-22 ['R Agrawal', 'R Srikant']
conf/vldb/Brin95 2018-12-26 ['S Brin']
conf/vldb/ChakravarthyKAK94 2018-12-29 ['S Chakravarthy', 'V Krishnaprasad', 'E Anwar', 'S Kim']
conf/vldb/MedianoCD94 2018-12-26 ['M Mediano', 'M Casanova', 'M Dreux']
conf/vldb/SistlaYH94 2018-12-25 ['A Sistla', 'C Yu', 'R Haddad']
journals/sigmod/PourabbasR00 2018-12-20 ['E Pourabbas', 'M Rafanelli']
conf/sigmod/MelnikRB03 2018-12-21 ['S Melnik', 'E Rahm', 'P Bernstein']
conf/sigmod/ZhangDWEMPMDR03 2018-12-26 ['X Zhang', 'K Dimitrova', 'L Wang', 'M El-Sayed', 'B Murphy', 'B Pielech', 'M Mulchandani', 'L Ding', 'E Rundensteiner']
conf/sigmod/ZhouWGGZWXYF03 2018-12-27 ['A Zhou', 'Q Wang', 'Z Guo', 'X Gong', 'S Zheng', 'H Wu', 'J Xiao', 'K Yue', 'W Fan']

For scholar:

[6]:
@rltk.remove_raw_object
class Scholar(rltk.Record):
    @rltk.cached_property
    def id(self):
        return self.raw_object['id']

    @rltk.cached_property
    def date(self):
        return datetime.strptime(self.raw_object['date'], '%d, %b %Y').date().strftime('%Y-%m-%d')

    @rltk.cached_property
    def names(self):
        return list(map(lambda x: x.strip(), self.raw_object['names'].split(',')))
[7]:
ds_scholar = rltk.Dataset(rltk.JsonLinesReader('resources/scholar.jl'), record_class=Scholar)

for r_scholar in ds_scholar.head():
    print(r_scholar.id, r_scholar.date, r_scholar.names)
ek26aiEheesJ 2018-12-26 ['M Fernandez', 'J Kang', 'A Levy', 'D Suciu']
rmtEGXAXHKIJ 2018-12-29 ['S Adali', 'KS Candan', 'Y Papakonstantinou', 'VS']
D0z0BDnbnFcJ 2018-12-27 ['S Christodoulakis']
noTo81QxmHQJ 2018-12-28 ['ACMS Anthology', 'P Edition']
l0W27c1C3NwJ 2018-12-28 ['W Litwin', 'MA Neimat', 'DA Schneider']
IkNOhDqEY18J 2018-12-26 ['S Acharya', 'PB Gibbons']
6QZGeKna5lgJ 2018-12-23 ['T Gri']
XFCkL9QhTjIJ 2018-12-25 ['K Koperski', 'J Han']
9Wo54Wyh_X8J 2018-12-23 ['H Garcia-Molina', 'S Raghavan']
9uxj2XzGt9UJ 2018-12-28 ['M Flickner', 'H Sawhney', 'W Niblack', 'J Ashley', 'Q']

Decorator cached_property means the property value will be pre-computed while generating the Dataset, it’s especially useful to cache the value while the transformation of property is time consuming (e.g., tokenization, vectorization). remove_raw_object is used to release the space of raw_object after all properties are being cached.

If you prefer to do data cleaning and manipulation in pandas.Dataframe, you can build Dataset from it easily.

[8]:
# do data tranformation in df_scholar first, then:

class Scholar2(rltk.AutoGeneratedRecord):
    pass

ds_scholar2 = rltk.Dataset(rltk.DataFrameReader(df_scholar), record_class=Scholar2)

for r_scholar2 in ds_scholar2.head():
    print(r_scholar2.id, r_scholar2.date, r_scholar2.names)
ek26aiEheesJ 26, Dec 2018 M Fernandez, J Kang, A Levy, D Suciu
rmtEGXAXHKIJ 29, Dec 2018 S Adali, KS Candan, Y Papakonstantinou, VS
D0z0BDnbnFcJ 27, Dec 2018 S Christodoulakis
noTo81QxmHQJ 28, Dec 2018 ACMS Anthology, P Edition
l0W27c1C3NwJ 28, Dec 2018 W Litwin, MA Neimat, DA Schneider
IkNOhDqEY18J 26, Dec 2018 S Acharya, PB Gibbons
6QZGeKna5lgJ 23, Dec 2018 T Gri
XFCkL9QhTjIJ 25, Dec 2018 K Koperski, J Han
9Wo54Wyh_X8J 23, Dec 2018 H Garcia-Molina, S Raghavan
9uxj2XzGt9UJ 28, Dec 2018 M Flickner, H Sawhney, W Niblack, J Ashley, Q

Blocking

Blocking can be used to eliminate obvious impossible pairs then greatly reduce unnecessary comparisons.

In this example, date is an ideal key for blocking.

[9]:
bg = rltk.HashBlockGenerator()
block = bg.generate(
    bg.block(ds_dblp, property_='date'),
    bg.block(ds_scholar, property_='date')
)

If you want to know what’s in a block aggregated by key, you can iterate on the key_set_adapter in block object. Block is stored in a concrete KeySetAdapter (default is MemoryKeySetAdapter).

[10]:
for idx, b in enumerate(block.key_set_adapter):
    if idx == 5: break
    print(b)
('2018-12-24', {('Scholar', 'BTalXWt3faUJ'), ('Scholar', 'bTYTn8VG5hIJ'), ('Scholar', 'sHJ914nPZtUJ'), ('DBLP', 'conf/sigmod/CherniackZ96'), ('Scholar', 'c9Humx2-EMgJ'), ('Scholar', 'YMcmy4FOXi8J'), ('Scholar', 'W1IcM8IUwAEJ'), ('DBLP', 'journals/sigmod/Yang94'), ('Scholar', 'wLNJcNvsulkJ'), ('DBLP', 'journals/sigmod/BohmR94'), ('DBLP', 'conf/sigmod/SimmenSM96'), ('DBLP', 'conf/sigmod/TatarinovIHW01'), ('DBLP', 'journals/vldb/BarbaraI95'), ('DBLP', 'conf/vldb/RohmBSS02'), ('Scholar', 'XVP8s4K0Bg4J'), ('Scholar', 'jfkafZcMjgIJ'), ('DBLP', 'conf/vldb/CosleyLP02'), ('DBLP', 'journals/sigmod/HummerLW02')})
('2018-12-22', {('DBLP', 'journals/sigmod/SilberschatzSU96'), ('Scholar', 'ckrgSn0vBOMJ'), ('Scholar', 'cIJQ0qxrkMIJ'), ('Scholar', 'ZnWLup8HMkUJ'), ('DBLP', 'journals/tods/StolboushkinT98'), ('Scholar', '-iaSLKFHwUkJ'), ('DBLP', 'journals/tods/FernandezKSMT02'), ('Scholar', 'soiN2U4tXykJ'), ('Scholar', 'x4HkJDEYFmYJ'), ('DBLP', 'journals/tods/FranklinCL97'), ('DBLP', 'conf/vldb/AgrawalS94'), ('DBLP', 'conf/sigmod/GibbonsM98')})
('2018-12-26', {('DBLP', 'conf/vldb/RothS97'), ('DBLP', 'journals/sigmod/DogacDKOONEHAKKM95'), ('DBLP', 'conf/sigmod/ZhangDWEMPMDR03'), ('Scholar', 'ek26aiEheesJ'), ('Scholar', '1hkVjoUg8hUJ'), ('Scholar', 'F2ecYx97F2sJ'), ('DBLP', 'journals/vldb/LiR99'), ('Scholar', 'rDObsYKVroMJ'), ('DBLP', 'conf/sigmod/AcharyaGPR99a'), ('Scholar', 'IkNOhDqEY18J'), ('Scholar', 'fXziEl_Htv8J'), ('Scholar', 'LxyVmHubIfUJ'), ('DBLP', 'conf/sigmod/FernandezFKLS97'), ('Scholar', 'qwjRkZuiMHsJ'), ('DBLP', 'conf/sigmod/NybergBCGL94'), ('DBLP', 'conf/sigmod/BreunigKKS01'), ('DBLP', 'conf/sigmod/LometW98'), ('DBLP', 'conf/vldb/MedianoCD94'), ('Scholar', 'Ko9e8CH2Si4J'), ('Scholar', 'DwwSuaisX5QJ'), ('Scholar', 'oAO74aolStoJ'), ('Scholar', 'jXvsW6VxbMYJ'), ('DBLP', 'conf/vldb/Brin95')})
('2018-12-29', {('Scholar', 'Ph7ZpmdNOPEJ'), ('Scholar', 'OmYc0wE1j4kJ'), ('DBLP', 'journals/sigmod/SouzaS99'), ('Scholar', 'rmtEGXAXHKIJ'), ('Scholar', '3M_0Kd8NNjgJ'), ('DBLP', 'conf/vldb/ChakravarthyKAK94'), ('DBLP', 'conf/sigmod/AdaliCPS96'), ('Scholar', 'tbZ0J3HLI18J'), ('DBLP', 'journals/sigmod/KappelR98'), ('DBLP', 'conf/vldb/MeccaCM01')})
('2018-12-25', {('Scholar', 'f1wgD54UUKwJ'), ('Scholar', 'RusJdYPDgQ4J'), ('Scholar', 'zkbTv93Zp1UJ'), ('Scholar', 'S8x6zjXc9oAJ'), ('Scholar', '0aJOXauNqYIJ'), ('Scholar', 'XFCkL9QhTjIJ'), ('DBLP', 'conf/vldb/SistlaYH94'), ('DBLP', 'conf/sigmod/HanKS97'), ('Scholar', 'xF8s5N7oUIMJ'), ('DBLP', 'journals/sigmod/FlorescuLM98'), ('Scholar', '_jl3bN2QlE4J'), ('Scholar', '0HlMHEPJRH4J'), ('DBLP', 'conf/sigmod/MamoulisP99'), ('DBLP', 'conf/sigmod/TatarinovVBSSZ02'), ('DBLP', 'conf/vldb/DeutschPT99'), ('DBLP', 'journals/tods/CliffordDIJS97'), ('DBLP', 'journals/vldb/HarrisR96'), ('DBLP', 'conf/sigmod/HuangSW94')})

Pairwise comparison

Now let’s find out real pairs in all candidate pairs.

First of all, you need to figure out how to measure two records.

[11]:
def is_pair(r1, r2):
    for n1, n2 in zip(sorted(r1.names), sorted(r2.names)):
        if rltk.levenshtein_distance(n1, n2) > min(len(n1), len(n2)) / 3:
            return False
    return True

Then, make comparison on all candidate pairs (generated within blocks).

[12]:
for r_dblp, r_scholar in rltk.get_record_pairs(ds_dblp, ds_scholar):
    if is_pair(r_dblp, r_scholar):
        print(r_dblp.names, r_scholar.names)
['W Hümmer', 'W Lehner', 'H Wedekind'] ['W Huemmer', 'W Lehner', 'H Wedekind']
['R Agrawal', 'R Srikant'] ['R Sfikant', 'R Agrawal']
['S Brin'] ['S Brin']
['S Chakravarthy', 'V Krishnaprasad', 'E Anwar', 'S Kim'] ['S Chakravarthy', 'V Krishnaprasad', 'E Anwar', 'SK Kim']
['A Sistla', 'C Yu', 'R Haddad'] ['AP Sistla', 'CT Yu', 'R Haddad']
['E Pourabbas', 'M Rafanelli'] ['E Pourabbas', 'M Rafanelli']
['S Melnik', 'E Rahm', 'P Bernstein'] ['S Melnik', 'E Rahm', 'PA Bernstein']
['S Melnik', 'E Rahm', 'P Bernstein'] ['E Rahm']
['L Libkin'] ['L Libkin']
['I Tatarinov', 'S Viglas', 'K Beyer', 'J Shanmugasundaram', 'E Shekita', 'C Zhang'] ['X Zhang']
['J Gray', 'G Graefe'] ['J Gray', 'G Graefe']
['D Florescu', 'A Levy', 'A Mendelzon'] ['F Levy']
['G Kappel', 'W Retschitzegger'] ['G Kappel', 'W Retschitzegger']
['I Tatarinov', 'Z Ives', 'A Halevy', 'D Weld'] ['I Tatarinov', 'ZG Ives', 'AY Halevy', 'DS Weld']
['A Silberschatz', 'M Stonebraker', 'J Ullman'] ['A Silberschatz', 'M Stonebraker', 'J Ullman']
['R Baeza-Yates', 'G Navarro'] ['R Baeza-Yates', 'G Navarro']
['P Buneman', 'L Raschid', 'J Ullman'] ['P Buneman', 'L Raschid', 'JD Ullman']
['K Böhm', 'T Rakow'] ['K Bohme', 'TC Rakow']
['H Darwen', 'C Date'] ['H Darwen', 'CJ Date']
['M Lee', 'M Kitsuregawa', 'B Ooi', 'K Tan', 'A Mondal'] ['ML Lee', 'M Kitsuregawa', 'BC Ooi', 'KL Tan', 'A Mondal']
['N Mamoulis', 'D Papadias'] ['N Mamoulis', 'D Papadias']
['S Acharya', 'P Gibbons', 'V Poosala', 'S Ramaswamy'] ['S Acharya', 'PB Gibbons']
['L Yang'] ['L Yang']
['G Manku', 'S Rajagopalan', 'B Lindsay'] ['GS Manku', 'S Rajagopalan', 'BG Lindsay']
['P Brown'] ['P Brown']
['D Lomet', 'G Weikum'] ['D Lomet', 'G Weikum']
['S Berchtold', 'D Keim'] ['S Berchtold', 'DA Keim']
['P Gibbons', 'Y Matias'] ['PB Gibbons', 'Y Matias']
['J Hellerstein', 'P Haas', 'H Wang'] ['JM Hellerstein', 'JP Haas', 'HJ Wang']
['J Hellerstein', 'P Haas', 'H Wang'] ['L Yang']
['B Adelberg', 'H Garcia-Molina', 'J Widom'] ['B Adelberg', 'H Garcia-Molina', 'J Widom']
['J Han', 'K Koperski', 'N Stefanovic'] ['K Koperski', 'J Han']
['D Simmen', 'E Shekita', 'T Malkemus'] ['DE Simmen', 'EJ Shekita', 'T Malkemus']
['M Fernandez', 'D Florescu', 'J Kang', 'A Levy', 'D Suciu'] ['F Levy']
['A Deutsch', 'L Popa', 'V Tannen'] ['A Deutsch', 'L Popa', 'V Tannen']
['K Mogi', 'M Kitsuregawa'] ['K Mogi', 'M Kitsuregawa']
['J Shanmugasundaram', 'K Tufte', 'C Zhang', 'G He', 'D DeWitt', 'J Naughton'] ['X Zhang']
['P Hung', 'H Yeung', 'K Karlapalem'] ['PCK Hung', 'HP Yeung', 'K Karlapalem']
['R Srikant', 'R Agrawal'] ['R Sfikant', 'R Agrawal']
['M Cherniack', 'S Zdonik'] ['M Chemiack', 'S Zdonik']
['G Gardarin', 'F Machuca', 'P Pucheral'] ['G Gardarin', 'F Machuca']
['T Griffin', 'L Libkin'] ['L Libkin']
['M Roth', 'P Schwarz'] ['PM Schwarz', 'MT Roth']
['D Srivastava', 'S Dar', 'H Jagadish', 'A Levy'] ['F Levy']
['D Srivastava', 'S Dar', 'H Jagadish', 'A Levy'] ['S Dar', 'HV Jagadish', 'AY Levy', 'D Srivastava']
['M Carey', 'D DeWitt'] ['MJ Carey', 'DJ DeWitt']
['K Sagonas', 'T Swift', 'D Warren'] ['K Sagonas', 'T Swift', 'DS Warren']
['V Raghavan'] ['V ay Raghavan']
['X Wang', 'M Cherniack'] ['X Wang', 'M Cherniack']
['M Petrovic', 'I Burcea', 'H Jacobsen'] ['M Petrovic', 'I Burcea', 'HA Jacobsen']
['S Raghavan', 'H Garcia-Molina'] ['H Garcia-Molina', 'S Raghavan']
['D Cosley', 'S Lawrence', 'D Pennock'] ['D Cosley', 'S Lawrence', 'DM Pennock']
['K Goldman', 'N Lynch'] ['KJ Goldman', 'N Lynch']
['S Guo', 'W Sun', 'M Weiss'] ['S Guo', 'W Sun', 'MA Weiss']
['W Litwin', 'M Neimat', 'D Schneider'] ['W Litwin', 'MA Neimat', 'DA Schneider']
['A Stolboushkin', 'M Taitslin'] ['AP Stolboushkin', 'MA Taitslin']
['V Verykios', 'G Moustakides', 'M Elfeky'] ['VS Verykios', 'GV Moustakides', 'MG Elfeky']
['C Lee', 'C Shih', 'Y Chen'] ['C Lee', 'CS Shih', 'YH Chen']
['E Rahm', 'P Bernstein'] ['S Melnik', 'E Rahm', 'PA Bernstein']
['E Rahm', 'P Bernstein'] ['E Rahm']
['S Sarawagi'] ['S Sarawagi']
['E Harris', 'K Ramamohanarao'] ['EP Harris', 'K Ramamohanarao']
['D Barbará', 'T Imielinski'] ['D Barbara', 'T Imielinski']
['A Dan', 'P Yu', 'J Chung'] ['A Dan', 'PS Yu', 'JY Chung']
['B Hammond'] ['B Hammond']

Evaluation

How do I know the performance of the strategy that I use? Evaluation is a built-in module for benchmarking.

The first step is to label data to get ground truth.

[13]:
gt = rltk.GroundTruth()
with open('resources/dblp_scholar_gt.csv') as f:
    for d in rltk.CSVReader(f): # this can be replace to python csv reader
        gt.add_positive(d['idDBLP'], d['idScholar'])
gt.generate_all_negatives(ds_dblp, ds_scholar, range_in_gt=True)

Trial is used to records all the result for further evaluation. It needs to have an associated GroundTruth.

[14]:
trial = rltk.Trial(gt)
for r_dblp, r_scholar in rltk.get_record_pairs(ds_dblp, ds_scholar):
    if is_pair(r_dblp, r_scholar):
        trial.add_positive(r_dblp, r_scholar)
    else:
        trial.add_negative(r_dblp, r_scholar)
trial.evaluate()
print('precison:', trial.precision, 'recall:', trial.recall, 'f-measure:', trial.f_measure)
print('tp:', len(trial.true_positives_list))
print('fp:', len(trial.false_positives_list))
print('tn:', len(trial.true_negatives_list))
print('fn:', len(trial.false_negatives_list))
precison: 0.8615384615384616 recall: 0.5894736842105263 f-measure: 0.7
tp: 56
fp: 9
tn: 8824
fn: 39