Step-by-step example¶
In this tutorial, you will go through an example end to end. Here are the main steps you will go through:
- Dataset analysis
- Construct RLTK datasets
- Blocking
- Pairwise comparison
- Evaluation
Dataset analysis¶
The dataset used here is an artificial dataset which contructed from DBLP and Scholar data. Let’s take a look.
[1]:
# initialization
import os
from datetime import datetime
import pandas as pd
from IPython.display import display
import rltk
[2]:
df_dblp = pd.read_csv('resources/dblp.csv', parse_dates=False)
df_dblp.head()
[2]:
id | names | date | |
---|---|---|---|
0 | journals/sigmod/HummerLW02 | W Hümmer, W Lehner, H Wedekind | 2018-12-24 |
1 | conf/vldb/AgrawalS94 | R Agrawal, R Srikant | 2018-12-22 |
2 | conf/vldb/Brin95 | S Brin | 2018-12-26 |
3 | conf/vldb/ChakravarthyKAK94 | S Chakravarthy, V Krishnaprasad, E Anwar, S Kim | 2018-12-29 |
4 | conf/vldb/MedianoCD94 | M Mediano, M Casanova, M Dreux | 2018-12-26 |
[3]:
df_scholar = pd.read_json('resources/scholar.jl', lines=True, convert_dates=False)
df_scholar.head()
[3]:
date | id | names | |
---|---|---|---|
0 | 26, Dec 2018 | ek26aiEheesJ | M Fernandez, J Kang, A Levy, D Suciu |
1 | 29, Dec 2018 | rmtEGXAXHKIJ | S Adali, KS Candan, Y Papakonstantinou, VS |
2 | 27, Dec 2018 | D0z0BDnbnFcJ | S Christodoulakis |
3 | 28, Dec 2018 | noTo81QxmHQJ | ACMS Anthology, P Edition |
4 | 28, Dec 2018 | l0W27c1C3NwJ | W Litwin, MA Neimat, DA Schneider |
By a glance, it’s easy to find out that both datasets have id, date and names.
- Dates have different formats
- Names columns contains many names separated by comma.
Construct RLTK datasets¶
In RLTK, the data collection is named Dataset
and each “row” is a Record
instance. In order to construct a Dataset
, you need to read data from source by a specific Reader
, then the data is presented in a Python Dict raw_object
which can be use to construct Record
instance by the schema (concrete class of Record
) you definded.
For DBLP:
[4]:
class DBLP(rltk.Record):
@property
def id(self):
return self.raw_object['id']
@property
def date(self):
return self.raw_object['date']
@property
def names(self):
return list(map(lambda x: x.strip(), self.raw_object['names'].split(',')))
[5]:
ds_dblp = rltk.Dataset(rltk.CSVReader('resources/dblp.csv'), record_class=DBLP)
for r_dblp in ds_dblp.head():
print(r_dblp.id, r_dblp.date, r_dblp.names)
journals/sigmod/HummerLW02 2018-12-24 ['W Hümmer', 'W Lehner', 'H Wedekind']
conf/vldb/AgrawalS94 2018-12-22 ['R Agrawal', 'R Srikant']
conf/vldb/Brin95 2018-12-26 ['S Brin']
conf/vldb/ChakravarthyKAK94 2018-12-29 ['S Chakravarthy', 'V Krishnaprasad', 'E Anwar', 'S Kim']
conf/vldb/MedianoCD94 2018-12-26 ['M Mediano', 'M Casanova', 'M Dreux']
conf/vldb/SistlaYH94 2018-12-25 ['A Sistla', 'C Yu', 'R Haddad']
journals/sigmod/PourabbasR00 2018-12-20 ['E Pourabbas', 'M Rafanelli']
conf/sigmod/MelnikRB03 2018-12-21 ['S Melnik', 'E Rahm', 'P Bernstein']
conf/sigmod/ZhangDWEMPMDR03 2018-12-26 ['X Zhang', 'K Dimitrova', 'L Wang', 'M El-Sayed', 'B Murphy', 'B Pielech', 'M Mulchandani', 'L Ding', 'E Rundensteiner']
conf/sigmod/ZhouWGGZWXYF03 2018-12-27 ['A Zhou', 'Q Wang', 'Z Guo', 'X Gong', 'S Zheng', 'H Wu', 'J Xiao', 'K Yue', 'W Fan']
For scholar:
[6]:
@rltk.remove_raw_object
class Scholar(rltk.Record):
@rltk.cached_property
def id(self):
return self.raw_object['id']
@rltk.cached_property
def date(self):
return datetime.strptime(self.raw_object['date'], '%d, %b %Y').date().strftime('%Y-%m-%d')
@rltk.cached_property
def names(self):
return list(map(lambda x: x.strip(), self.raw_object['names'].split(',')))
[7]:
ds_scholar = rltk.Dataset(rltk.JsonLinesReader('resources/scholar.jl'), record_class=Scholar)
for r_scholar in ds_scholar.head():
print(r_scholar.id, r_scholar.date, r_scholar.names)
ek26aiEheesJ 2018-12-26 ['M Fernandez', 'J Kang', 'A Levy', 'D Suciu']
rmtEGXAXHKIJ 2018-12-29 ['S Adali', 'KS Candan', 'Y Papakonstantinou', 'VS']
D0z0BDnbnFcJ 2018-12-27 ['S Christodoulakis']
noTo81QxmHQJ 2018-12-28 ['ACMS Anthology', 'P Edition']
l0W27c1C3NwJ 2018-12-28 ['W Litwin', 'MA Neimat', 'DA Schneider']
IkNOhDqEY18J 2018-12-26 ['S Acharya', 'PB Gibbons']
6QZGeKna5lgJ 2018-12-23 ['T Gri']
XFCkL9QhTjIJ 2018-12-25 ['K Koperski', 'J Han']
9Wo54Wyh_X8J 2018-12-23 ['H Garcia-Molina', 'S Raghavan']
9uxj2XzGt9UJ 2018-12-28 ['M Flickner', 'H Sawhney', 'W Niblack', 'J Ashley', 'Q']
Decorator cached_property
means the property value will be pre-computed while generating the Dataset
, it’s especially useful to cache the value while the transformation of property is time consuming (e.g., tokenization, vectorization). remove_raw_object
is used to release the space of raw_object
after all properties are being cached.
If you prefer to do data cleaning and manipulation in pandas.Dataframe
, you can build Dataset
from it easily.
[8]:
# do data tranformation in df_scholar first, then:
class Scholar2(rltk.AutoGeneratedRecord):
pass
ds_scholar2 = rltk.Dataset(rltk.DataFrameReader(df_scholar), record_class=Scholar2)
for r_scholar2 in ds_scholar2.head():
print(r_scholar2.id, r_scholar2.date, r_scholar2.names)
ek26aiEheesJ 26, Dec 2018 M Fernandez, J Kang, A Levy, D Suciu
rmtEGXAXHKIJ 29, Dec 2018 S Adali, KS Candan, Y Papakonstantinou, VS
D0z0BDnbnFcJ 27, Dec 2018 S Christodoulakis
noTo81QxmHQJ 28, Dec 2018 ACMS Anthology, P Edition
l0W27c1C3NwJ 28, Dec 2018 W Litwin, MA Neimat, DA Schneider
IkNOhDqEY18J 26, Dec 2018 S Acharya, PB Gibbons
6QZGeKna5lgJ 23, Dec 2018 T Gri
XFCkL9QhTjIJ 25, Dec 2018 K Koperski, J Han
9Wo54Wyh_X8J 23, Dec 2018 H Garcia-Molina, S Raghavan
9uxj2XzGt9UJ 28, Dec 2018 M Flickner, H Sawhney, W Niblack, J Ashley, Q
Blocking¶
Blocking can be used to eliminate obvious impossible pairs then greatly reduce unnecessary comparisons.
In this example, date is an ideal key for blocking.
[9]:
bg = rltk.HashBlockGenerator()
block = bg.generate(
bg.block(ds_dblp, property_='date'),
bg.block(ds_scholar, property_='date')
)
If you want to know what’s in a block aggregated by key, you can iterate on the key_set_adapter
in block object. Block is stored in a concrete KeySetAdapter
(default is MemoryKeySetAdapter
).
[10]:
for idx, b in enumerate(block.key_set_adapter):
if idx == 5: break
print(b)
('2018-12-24', {('Scholar', 'BTalXWt3faUJ'), ('Scholar', 'bTYTn8VG5hIJ'), ('Scholar', 'sHJ914nPZtUJ'), ('DBLP', 'conf/sigmod/CherniackZ96'), ('Scholar', 'c9Humx2-EMgJ'), ('Scholar', 'YMcmy4FOXi8J'), ('Scholar', 'W1IcM8IUwAEJ'), ('DBLP', 'journals/sigmod/Yang94'), ('Scholar', 'wLNJcNvsulkJ'), ('DBLP', 'journals/sigmod/BohmR94'), ('DBLP', 'conf/sigmod/SimmenSM96'), ('DBLP', 'conf/sigmod/TatarinovIHW01'), ('DBLP', 'journals/vldb/BarbaraI95'), ('DBLP', 'conf/vldb/RohmBSS02'), ('Scholar', 'XVP8s4K0Bg4J'), ('Scholar', 'jfkafZcMjgIJ'), ('DBLP', 'conf/vldb/CosleyLP02'), ('DBLP', 'journals/sigmod/HummerLW02')})
('2018-12-22', {('DBLP', 'journals/sigmod/SilberschatzSU96'), ('Scholar', 'ckrgSn0vBOMJ'), ('Scholar', 'cIJQ0qxrkMIJ'), ('Scholar', 'ZnWLup8HMkUJ'), ('DBLP', 'journals/tods/StolboushkinT98'), ('Scholar', '-iaSLKFHwUkJ'), ('DBLP', 'journals/tods/FernandezKSMT02'), ('Scholar', 'soiN2U4tXykJ'), ('Scholar', 'x4HkJDEYFmYJ'), ('DBLP', 'journals/tods/FranklinCL97'), ('DBLP', 'conf/vldb/AgrawalS94'), ('DBLP', 'conf/sigmod/GibbonsM98')})
('2018-12-26', {('DBLP', 'conf/vldb/RothS97'), ('DBLP', 'journals/sigmod/DogacDKOONEHAKKM95'), ('DBLP', 'conf/sigmod/ZhangDWEMPMDR03'), ('Scholar', 'ek26aiEheesJ'), ('Scholar', '1hkVjoUg8hUJ'), ('Scholar', 'F2ecYx97F2sJ'), ('DBLP', 'journals/vldb/LiR99'), ('Scholar', 'rDObsYKVroMJ'), ('DBLP', 'conf/sigmod/AcharyaGPR99a'), ('Scholar', 'IkNOhDqEY18J'), ('Scholar', 'fXziEl_Htv8J'), ('Scholar', 'LxyVmHubIfUJ'), ('DBLP', 'conf/sigmod/FernandezFKLS97'), ('Scholar', 'qwjRkZuiMHsJ'), ('DBLP', 'conf/sigmod/NybergBCGL94'), ('DBLP', 'conf/sigmod/BreunigKKS01'), ('DBLP', 'conf/sigmod/LometW98'), ('DBLP', 'conf/vldb/MedianoCD94'), ('Scholar', 'Ko9e8CH2Si4J'), ('Scholar', 'DwwSuaisX5QJ'), ('Scholar', 'oAO74aolStoJ'), ('Scholar', 'jXvsW6VxbMYJ'), ('DBLP', 'conf/vldb/Brin95')})
('2018-12-29', {('Scholar', 'Ph7ZpmdNOPEJ'), ('Scholar', 'OmYc0wE1j4kJ'), ('DBLP', 'journals/sigmod/SouzaS99'), ('Scholar', 'rmtEGXAXHKIJ'), ('Scholar', '3M_0Kd8NNjgJ'), ('DBLP', 'conf/vldb/ChakravarthyKAK94'), ('DBLP', 'conf/sigmod/AdaliCPS96'), ('Scholar', 'tbZ0J3HLI18J'), ('DBLP', 'journals/sigmod/KappelR98'), ('DBLP', 'conf/vldb/MeccaCM01')})
('2018-12-25', {('Scholar', 'f1wgD54UUKwJ'), ('Scholar', 'RusJdYPDgQ4J'), ('Scholar', 'zkbTv93Zp1UJ'), ('Scholar', 'S8x6zjXc9oAJ'), ('Scholar', '0aJOXauNqYIJ'), ('Scholar', 'XFCkL9QhTjIJ'), ('DBLP', 'conf/vldb/SistlaYH94'), ('DBLP', 'conf/sigmod/HanKS97'), ('Scholar', 'xF8s5N7oUIMJ'), ('DBLP', 'journals/sigmod/FlorescuLM98'), ('Scholar', '_jl3bN2QlE4J'), ('Scholar', '0HlMHEPJRH4J'), ('DBLP', 'conf/sigmod/MamoulisP99'), ('DBLP', 'conf/sigmod/TatarinovVBSSZ02'), ('DBLP', 'conf/vldb/DeutschPT99'), ('DBLP', 'journals/tods/CliffordDIJS97'), ('DBLP', 'journals/vldb/HarrisR96'), ('DBLP', 'conf/sigmod/HuangSW94')})
Pairwise comparison¶
Now let’s find out real pairs in all candidate pairs.
First of all, you need to figure out how to measure two records.
[11]:
def is_pair(r1, r2):
for n1, n2 in zip(sorted(r1.names), sorted(r2.names)):
if rltk.levenshtein_distance(n1, n2) > min(len(n1), len(n2)) / 3:
return False
return True
Then, make comparison on all candidate pairs (generated within blocks).
[12]:
for r_dblp, r_scholar in rltk.get_record_pairs(ds_dblp, ds_scholar):
if is_pair(r_dblp, r_scholar):
print(r_dblp.names, r_scholar.names)
['W Hümmer', 'W Lehner', 'H Wedekind'] ['W Huemmer', 'W Lehner', 'H Wedekind']
['R Agrawal', 'R Srikant'] ['R Sfikant', 'R Agrawal']
['S Brin'] ['S Brin']
['S Chakravarthy', 'V Krishnaprasad', 'E Anwar', 'S Kim'] ['S Chakravarthy', 'V Krishnaprasad', 'E Anwar', 'SK Kim']
['A Sistla', 'C Yu', 'R Haddad'] ['AP Sistla', 'CT Yu', 'R Haddad']
['E Pourabbas', 'M Rafanelli'] ['E Pourabbas', 'M Rafanelli']
['S Melnik', 'E Rahm', 'P Bernstein'] ['S Melnik', 'E Rahm', 'PA Bernstein']
['S Melnik', 'E Rahm', 'P Bernstein'] ['E Rahm']
['L Libkin'] ['L Libkin']
['I Tatarinov', 'S Viglas', 'K Beyer', 'J Shanmugasundaram', 'E Shekita', 'C Zhang'] ['X Zhang']
['J Gray', 'G Graefe'] ['J Gray', 'G Graefe']
['D Florescu', 'A Levy', 'A Mendelzon'] ['F Levy']
['G Kappel', 'W Retschitzegger'] ['G Kappel', 'W Retschitzegger']
['I Tatarinov', 'Z Ives', 'A Halevy', 'D Weld'] ['I Tatarinov', 'ZG Ives', 'AY Halevy', 'DS Weld']
['A Silberschatz', 'M Stonebraker', 'J Ullman'] ['A Silberschatz', 'M Stonebraker', 'J Ullman']
['R Baeza-Yates', 'G Navarro'] ['R Baeza-Yates', 'G Navarro']
['P Buneman', 'L Raschid', 'J Ullman'] ['P Buneman', 'L Raschid', 'JD Ullman']
['K Böhm', 'T Rakow'] ['K Bohme', 'TC Rakow']
['H Darwen', 'C Date'] ['H Darwen', 'CJ Date']
['M Lee', 'M Kitsuregawa', 'B Ooi', 'K Tan', 'A Mondal'] ['ML Lee', 'M Kitsuregawa', 'BC Ooi', 'KL Tan', 'A Mondal']
['N Mamoulis', 'D Papadias'] ['N Mamoulis', 'D Papadias']
['S Acharya', 'P Gibbons', 'V Poosala', 'S Ramaswamy'] ['S Acharya', 'PB Gibbons']
['L Yang'] ['L Yang']
['G Manku', 'S Rajagopalan', 'B Lindsay'] ['GS Manku', 'S Rajagopalan', 'BG Lindsay']
['P Brown'] ['P Brown']
['D Lomet', 'G Weikum'] ['D Lomet', 'G Weikum']
['S Berchtold', 'D Keim'] ['S Berchtold', 'DA Keim']
['P Gibbons', 'Y Matias'] ['PB Gibbons', 'Y Matias']
['J Hellerstein', 'P Haas', 'H Wang'] ['JM Hellerstein', 'JP Haas', 'HJ Wang']
['J Hellerstein', 'P Haas', 'H Wang'] ['L Yang']
['B Adelberg', 'H Garcia-Molina', 'J Widom'] ['B Adelberg', 'H Garcia-Molina', 'J Widom']
['J Han', 'K Koperski', 'N Stefanovic'] ['K Koperski', 'J Han']
['D Simmen', 'E Shekita', 'T Malkemus'] ['DE Simmen', 'EJ Shekita', 'T Malkemus']
['M Fernandez', 'D Florescu', 'J Kang', 'A Levy', 'D Suciu'] ['F Levy']
['A Deutsch', 'L Popa', 'V Tannen'] ['A Deutsch', 'L Popa', 'V Tannen']
['K Mogi', 'M Kitsuregawa'] ['K Mogi', 'M Kitsuregawa']
['J Shanmugasundaram', 'K Tufte', 'C Zhang', 'G He', 'D DeWitt', 'J Naughton'] ['X Zhang']
['P Hung', 'H Yeung', 'K Karlapalem'] ['PCK Hung', 'HP Yeung', 'K Karlapalem']
['R Srikant', 'R Agrawal'] ['R Sfikant', 'R Agrawal']
['M Cherniack', 'S Zdonik'] ['M Chemiack', 'S Zdonik']
['G Gardarin', 'F Machuca', 'P Pucheral'] ['G Gardarin', 'F Machuca']
['T Griffin', 'L Libkin'] ['L Libkin']
['M Roth', 'P Schwarz'] ['PM Schwarz', 'MT Roth']
['D Srivastava', 'S Dar', 'H Jagadish', 'A Levy'] ['F Levy']
['D Srivastava', 'S Dar', 'H Jagadish', 'A Levy'] ['S Dar', 'HV Jagadish', 'AY Levy', 'D Srivastava']
['M Carey', 'D DeWitt'] ['MJ Carey', 'DJ DeWitt']
['K Sagonas', 'T Swift', 'D Warren'] ['K Sagonas', 'T Swift', 'DS Warren']
['V Raghavan'] ['V ay Raghavan']
['X Wang', 'M Cherniack'] ['X Wang', 'M Cherniack']
['M Petrovic', 'I Burcea', 'H Jacobsen'] ['M Petrovic', 'I Burcea', 'HA Jacobsen']
['S Raghavan', 'H Garcia-Molina'] ['H Garcia-Molina', 'S Raghavan']
['D Cosley', 'S Lawrence', 'D Pennock'] ['D Cosley', 'S Lawrence', 'DM Pennock']
['K Goldman', 'N Lynch'] ['KJ Goldman', 'N Lynch']
['S Guo', 'W Sun', 'M Weiss'] ['S Guo', 'W Sun', 'MA Weiss']
['W Litwin', 'M Neimat', 'D Schneider'] ['W Litwin', 'MA Neimat', 'DA Schneider']
['A Stolboushkin', 'M Taitslin'] ['AP Stolboushkin', 'MA Taitslin']
['V Verykios', 'G Moustakides', 'M Elfeky'] ['VS Verykios', 'GV Moustakides', 'MG Elfeky']
['C Lee', 'C Shih', 'Y Chen'] ['C Lee', 'CS Shih', 'YH Chen']
['E Rahm', 'P Bernstein'] ['S Melnik', 'E Rahm', 'PA Bernstein']
['E Rahm', 'P Bernstein'] ['E Rahm']
['S Sarawagi'] ['S Sarawagi']
['E Harris', 'K Ramamohanarao'] ['EP Harris', 'K Ramamohanarao']
['D Barbará', 'T Imielinski'] ['D Barbara', 'T Imielinski']
['A Dan', 'P Yu', 'J Chung'] ['A Dan', 'PS Yu', 'JY Chung']
['B Hammond'] ['B Hammond']
Evaluation¶
How do I know the performance of the strategy that I use? Evaluation is a built-in module for benchmarking.
The first step is to label data to get ground truth.
[13]:
gt = rltk.GroundTruth()
with open('resources/dblp_scholar_gt.csv') as f:
for d in rltk.CSVReader(f): # this can be replace to python csv reader
gt.add_positive(d['idDBLP'], d['idScholar'])
gt.generate_all_negatives(ds_dblp, ds_scholar, range_in_gt=True)
Trial
is used to records all the result for further evaluation. It needs to have an associated GroundTruth
.
[14]:
trial = rltk.Trial(gt)
for r_dblp, r_scholar in rltk.get_record_pairs(ds_dblp, ds_scholar):
if is_pair(r_dblp, r_scholar):
trial.add_positive(r_dblp, r_scholar)
else:
trial.add_negative(r_dblp, r_scholar)
trial.evaluate()
print('precison:', trial.precision, 'recall:', trial.recall, 'f-measure:', trial.f_measure)
print('tp:', len(trial.true_positives_list))
print('fp:', len(trial.false_positives_list))
print('tn:', len(trial.true_negatives_list))
print('fn:', len(trial.false_negatives_list))
precison: 0.8615384615384616 recall: 0.5894736842105263 f-measure: 0.7
tp: 56
fp: 9
tn: 8824
fn: 39