Record Linkage Toolkit Documentation

3y ago
25 Views
3 Downloads
528.07 KB
108 Pages
Last View : 11d ago
Last Download : 3m ago
Upload by : Amalia Wilborn
Transcription

Record Linkage Toolkit DocumentationRelease 0.14Jonathan de BruinDec 04, 2019

First steps1About1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1.2 What is record linkage? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1.3 How to link records? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .33342Installation guide2.1 Python version support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2.2 Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2.3 Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .77773Link two datasets3.1 Introduction . . .3.2 Make record pairs3.3 Compare records .3.4 Full code . . . . .99101113Data deduplication4.1 Introduction . . .4.2 Make record pairs4.3 Compare records .4.4 Full code . . . . .151516171950. Preprocessing5.1 Cleaning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5.2 Phonetic encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .21212261. Indexing6.1 recordlinkage.Index object6.2 Algorithms . . . . . . . . . . . . .6.3 User-defined algorithms . . . . . .6.4 Examples . . . . . . . . . . . . . .47.25252730322. Comparing7.1 recordlinkage.Compare object7.2 Algorithms . . . . . . . . . . . . . .7.3 User-defined algorithms . . . . . . .7.4 Examples . . . . . . . . . . . . . . .3535384547i

893. Classification8.1 Classifiers . . . . . . .8.2 Adapters . . . . . . . .8.3 User-defined algorithms8.4 Examples . . . . . . . .8.5 Network . . . . . . . .4. Evaluation5151616265656710 Datasets7111 Miscellaneous7512 Annotation12.1 Generate annotation file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12.2 Manual labeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12.3 Export/read annotation file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7981828213 Classification algorithms13.1 Supervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .13.2 Unsupervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .85868914 Performance14.1 Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14.2 Comparing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .91919215 Contributing15.1 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .95959516 Release notes16.1 Version 0.14.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9797Bibliography99Indexii101

Record Linkage Toolkit Documentation, Release 0.14All you need to start linking records.First steps1

Record Linkage Toolkit Documentation, Release 0.142First steps

CHAPTER1About1.1 IntroductionThe Python Record Linkage Toolkit is a library to link records in or between data sources. The toolkit providesmost of the tools needed for record linkage and deduplication. The package contains indexing methods, functions tocompare records and classifiers. The package is developed for research and the linking of small or medium sized files.The project is inspired by the Freely Extensible Biomedical Record Linkage (FEBRL) project, which is a great project.In contrast with FEBRL, the recordlinkage project makes extensive use of data manipulation tools like pandas andnumpy. The use of pandas, a flexible and powerful data analysis and manipulation library for Python, makes therecord linkage process much easier and faster. The extensive pandas library can be used to integrate your recordlinkage directly into existing data manipulation projects.One of the aims of this project is to make an extensible record linkage framework. It is easy to include your ownindexing algorithms, comparison/similarity measures and classifiers. The main features of the Python Record LinkageToolkit are: Clean and standardise data with easy to use tools Make pairs of records with smart indexing methods such as blocking and sorted neighbourhood indexing Compare records with a large number of comparison and similarity measures for different types of variablessuch as strings, numbers and dates. Several classifications algorithms, both supervised and unsupervised algorithms. Common record linkage evaluation tools Several built-in datasets.1.2 What is record linkage?The term record linkage is used to indicate the procedure of bringing together information from two or more recordsthat are believed to belong to the same entity. Record linkage is used to link data from multiple data sources or to find3

Record Linkage Toolkit Documentation, Release 0.14duplicates in a single data source. In computer science, record linkage is also known as data matching or deduplication(in case of search duplicate records within a single file).In record linkage, the attributes of the entity (stored in a record) are used to link two or more records. Attributescan be unique entity identifiers (SSN, license plate number), but also attributes like (sur)name, date of birth andcar model/colour. The record linkage procedure can be represented as a workflow [Christen, 2012]. The steps are:cleaning, indexing, comparing, classifying and evaluation. If needed, the classified record pairs flow back to improvethe previous step. The Python Record Linkage Toolkit follows this workflow.See also:Christen, Peter. 2012. Data matching: concepts and techniques for record linkage, entity resolution, and duplicatedetection. Springer Science & Business Media.Fellegi, Ivan P and Alan B Sunter. 1969. “A theory for record linkage.” Journal of the American Statistical Association64(328):1183–1210.Dunn, Halbert L. 1946.36(12):1412–1416.“Record linkage.” American Journal of Public Health and the Nations HealthHerzog, Thomas N, Fritz J Scheuren and William E Winkler. 2007. Data quality and record linkage techniques. Vol. 1Springer.1.3 How to link records?Import the recordlinkage module with all important tools for record linkage and import the data manipulationframework pandas.import recordlinkageimport pandasConsider that you try to link two datasets with personal information like name, sex and date of birth. Load thesedatasets into a pandas DataFrame.df a pandas.DataFrame(YOUR FIRST DATASET)df b pandas.DataFrame(YOUR SECOND DATASET)Comparing all record can be computationally intensive. Therefore, we make smart set of candidate links with one ofthe built-in indexing techniques like blocking. Only records pairs agreeing on the surname are included.indexer date links indexer.index(df a, df b)Each candidate link needs to be compared on the comparable attributes. This can be done easily with theCompare class and the available comparison and similarity measures.compare recordlinkage.Compare()compare.string('name', 'name', method 'jarowinkler', threshold 0.85)compare.exact('sex', 'gender')compare.exact('dob', 'date of birth')compare.string('streetname', 'streetname', method 'damerau levenshtein', threshold 0. 7)compare.exact('place', 'placename')compare.exact('haircolor', 'haircolor', missing value 9)(continues on next page)4Chapter 1. About

Record Linkage Toolkit Documentation, Release 0.14(continued from previous page)# The comparison vectorscompare vectors compare.compute(candidate links, df a, df b)This record linkage package contains several classification algorithms. Plenty of the algorithms need trainings data(supervised learning) while some others are unsupervised. An example of supervised learning:true linkage pandas.Series(YOUR GOLDEN DATA, index pandas.MultiIndex(YOUR MULTI INDEX))logrg fit(compare vectors[true linkage.index], true linkage)logrg.predict(compare vectors)and an example of unsupervised learning (the well known ECM-algorithm):ecm recordlinkage.BernoulliEMClassifier()ecm.fit predict(compare vectors)1.3. How to link records?5

Record Linkage Toolkit Documentation, Release 0.146Chapter 1. About

CHAPTER2Installation guide2.1 Python version supportThe Python Record Linkage Toolkit supports the versions of Python that Pandas supports as well. You can find thesupported Python versions in the Pandas documentation.2.2 InstallationThe Python Record linkage Toolkit requires Python 3.5 or higher (since version 0.14). Install the package easilywith pippip install recordlinkagePython 2.7 users can use version 0.13, but it is advised to use Python 3.5.You can also clone the project on Github. The license of this record linkage package is BSD-3-Clause.2.3 DependenciesThe following packages are required. You probably have most of them already ;) numpy pandas ( 0.18.0) scipy sklearn jellyfish: Needed for approximate string comparison and string encoding. numexpr (optional): Used to speed up numeric comparisons.7

Record Linkage Toolkit Documentation, Release 0.148Chapter 2. Installation guide

CHAPTER3Link two datasets3.1 IntroductionThis example shows how two datasets with data about persons can be linked. We will try to link the data based onattributes like first name, surname, sex, date of birth, place and address. The data used in this example is part of Febrland is fictitious.First, start with importing the recordlinkage module. The submodule recordlinkage.datasets containsseveral datasets that can be used for testing. For this example, we use the Febrl datasets 4A and 4B. These datasetscan be loaded with the function load febrl4.[2]: import recordlinkagefrom recordlinkage.datasets import load febrl4The datasets are loaded with the following code. The returned datasets are of type pandas.DataFrame. This makesit easy to manipulate the data if desired. For details about data manipulation with pandas, see their comprehensivedocumentation http://pandas.pydata.org/.[3]: dfA, dfB load febrl4()dfA[3]:given namerec nnabelsiennabradleybrodeesurname street linomatthewsegan8123890537.9722288address 1\stanley streetpinkerton circuitsalkauskas crescentmacquoid placerandwick road.mclachlan crescentsmeaton circuitjondol placeaxon street(continues on next page)9

Record Linkage Toolkit Documentation, Release 0.14(continued from previous page)rec-66-orgkoulahouweling3address 2rec ec-4883-orgrec-66-orgmiamibega flatskelabroadbridge manoravalind.lantana lodgepanganihorseshoe ckgreenslopesold airdmillan roadmileham streetsuburb postcode statewinston hillsrichlandsdaptosouth graftonhoppers crossing.broomemckinnonjacobs 700701820672350\nswvicnswsavic.nswnswsaqldnswdate of birth soc sec idrec 06892766760390426375537[5000 rows x 10 columns]3.2 Make record pairsIt is very intuitive to compare each record in DataFrame dfA with all records of DataFrame dfB. In fact, we wantto make record pairs. Each record pair should contain one record of dfA and one record of dfB. This process ofmaking record pairs is also called ‘indexing’. With the recordlinkage module, indexing is easy. First, load theindex.Index class and call the .full method. This object generates a full index on a .index(.) call. Incase of deduplication of a single dataframe, one dataframe is sufficient as argument.[4]: indexer recordlinkage.Index()indexer.full()pairs indexer.index(dfA, dfB)WARNING:recordlinkage:indexing - performance warning - A full index can result in large number of record pairs.With the method index, all possible (and unique) record pairs are made. The method returns a pandas.MultiIndex. The number of pairs is equal to the number of records in dfA times the number of records in dfB.[5]: print (len(dfA), len(dfB), len(pairs))5000 5000 25000000Many of these record pairs do not belong to the same person. In case of one-to-one matching, the number of matches10Chapter 3. Link two datasets

Record Linkage Toolkit Documentation, Release 0.14should be no more than the number of records in the smallest dataframe. In case of full indexing, min(len(dfA),len(N dfB)) is much smaller than len(pairs). The recordlinkage module has some more advancedindexing methods to reduce the number of record pairs. Obvious non-matches are left out of the index. Note that if amatching record pair is not included in the index, it can not be matched anymore.One of the most well known indexing methods is named blocking. This method includes only record pairs that areidentical on one or more stored attributes of the person (or entity in general). The blocking method can be used in therecordlinkage module.[6]: indexer recordlinkage.Index()indexer.block('given name')candidate links indexer.index(dfA, dfB)print (len(candidate links))77249The argument ‘given name’ is the blocking variable. This variable has to be the name of a column in dfA and dfB. Itis possible to parse a list of columns names to block on multiple variables. Blocking on multiple variables will reducethe number of record pairs even further.Another implemented indexing method is Sorted Neighbourhood Indexing (recordlinkage.index.SortedNeighbourhood). This method is very useful when there are many misspellings in the string were usedfor indexing. In fact, sorted neighbourhood indexing is a generalisation of blocking. See the documentation for detailsabout sorted neighbourd indexing.3.3 Compare recordsEach record pair is a candidate match. To classify the candidate record pairs into matches and non-matches, compare the records on all attributes both records have in common. The recordlinkage module has a class namedCompare. This class is used to compare the records. The following code shows how to compare attributes.[7]: # This cell can take some time to compute.compare cl recordlinkage.Compare()compare cl.exact('given name', 'given name', label 'given name')compare cl.string('surname', 'surname', method 'jarowinkler', threshold 0.85, label 'surname')compare cl.exact('date of birth', 'date of birth', label 'date of birth')compare cl.exact('suburb', 'suburb', label 'suburb')compare cl.exact('state', 'state', label 'state')compare cl.string('address 1', 'address 1', threshold 0.85, label 'address 1')features compare cl.compute(candidate links, dfA, dfB)The comparing of record pairs starts when the compute method is called. All attribute comparisons are stored in aDataFrame with horizontally the features and vertically the record pairs.[8]: features[8]:rec id 1rec id 2rec-1070-org -dup-0given namesurnamedate of birthsuburb11110.00.00.00.000000000\(continues on next page)3.3. Compare records11

Record Linkage Toolkit Documentation, Release 0.14(continued from previous 5-dup-00.0.1.01.01.01.01.0stateaddress 110011.111110.00.00.00.00.0.1.01.01.00.01.0rec id 1rec id 2rec-1070-org -dup-0rec-1314-dup-0.rec-4528-org rec-4528-dup-0rec-4887-org rec-4887-dup-0rec-4350-org rec-4350-dup-0rec-4569-org rec-4569-dup-0rec-3125-org rec-3125-dup-00.111110.10110[77249 rows x 6 columns][9]: given 50%75%maxaddress 0.000000.000000.000001.00000date of 48770.432300.000000.000000.000000.000001.00000\The last step is to decide which records belong to the same person. In this example, we keep it simple:[10]: # Sum the comparison results.features.sum(axis 1).value counts().sort index(ascending False)[10]: 6.015665.013324.03433.01462.0164271.057435dtype: int6412Chapter 3. Link two datasets

Record Linkage Toolkit Documentation, Release 0.14[11]: features[features.sum(axis 1) 3][11]:rec id c-4569-orgrec-3125-orgrec id 5-dup-0rec-1016-dup-0rec id c-4569-orgrec-3125-orgrec id 5-dup-0rec-1016-dup-0given namesurnamedate of 0rec-3125-dup-0stateaddress 111110.111111.00.01.01.01.0.1.01.01.00.01.0\[3241 rows x 6 columns]3.4 Full code[12]: import recordlinkagefrom recordlinkage.datasets import load febrl4dfA, dfB load febrl4()# Indexation stepindexer recordlinkage.Index()indexer.block('given name')candidate links indexer.index(dfA, dfB)# Comparison stepcompare cl recordlinkage.Compare()compare cl.exact('given name', 'given name', label 'given name')compare cl.string('surname', 'surname', method 'jarowinkler', threshold 0.85, label 'surname')compare cl.exact('date of birth', 'date of birth', label 'date of birth')compare cl.exact('suburb', 'suburb', label 'suburb')compare cl.exact('state', 'state', label 'state')compare cl.string('address 1', 'address 1', threshold 0.85, label 'address 1')(continues on next page)3.4. Full code13

Record Linkage Toolkit Documentation, Release 0.14(continued from previous page)features compare cl.compute(candidate links, dfA, dfB)# Classification stepmatches features[features.sum(axis 1) 3]print(len(matches))324114Chapter 3. Link two datasets

CHAPTER4Data deduplication4.1 IntroductionThis example shows how to find records in datasets belonging to the same entity. In our case,we try to deduplicate adataset with records of persons. We will try to link within the dataset based on attributes like first name, surname, sex,date of birth

Christen, Peter. 2012. Data matching: concepts and techniques for record linkage, entity resolution, and duplicate detection. Springer Science & Business Media. Fellegi, Ivan P and Alan B Sunter. 1969. “A theory for record linkage.” Journal of the American Statistical Association 64(328):1183–1210. Dunn, Halbert L. 1946. “Record linkage.”

Related Documents:

2-5 UltraLift Concept 16 3-1 Watt’s Straight Line Mechanism 19 3-2 Fully Prismatic Linkage 19 3-3 One Replace Prismatic 20 3-4 Two Replaced Prismatics 21 3-5 Fully Revolute Linkage 22 3-6 Scissors Linkage 23 3-7 Parallel Linkage 24 3-8 Parallel Linkage with a Cam 24 3-9 Constant orientation linear linkage 25 3-10 Hydraulic Mechanism 25

4-bar linkage knee mechanism has a collection of instan-taneous centers of rotation. Many physicians prescribing AK- and TK-prostheses are not familiar with the trajectory of the instantaneous center of rotation of 4-bar linkage knee mechanisms applied. A 4-bar linkage knee mechanism is intrinsically extension-stable, meaning without extension

V010 Linkage 200 5.5" x 12.125" x 0.04" v4A-10P xA-0077-0416-7 Black Metal Mesh V43 Visor Peak Frame and V412 Linkage 6.5” x 11.8” V4D-10P XA-0077-0417-5 Clear Acetate with Anti-Fog Coating V43 Visor Peak Frame and V412 Linkage 200 7" x 11.8" X 0.04" V4F-10P XA-0077-0418-3 Clear Polycarbonate V43 Visor Peak Frame and V412 Linkage

P-Linkage Clustering: Based on the assumption that: a data point should be in the same cluster with its closest neighbor-ing point (CNP) which is more likely to be a cluster center, we propose a novel hierarchical clustering method named Pairwise Linkage (P-Linkage) which can discover th

Figure 4:1 Four-Bar-Linkage Diagram. A linkage is called a mechanism if two or more links are movable with respect to a fixed link. Mechanical linkages are usually designed to take an input and produce a different output. In the Four-Bar Linkage this input changes the behavior of the mechanism. According to Grashofs' Law [12]. We can .

1.2. Improved Computer-assisted Matching Methods Historically, most record linkage consisted entirely of clerical procedures in which clerks reviewed lists, obtained additional information when matching information was missing or contradictory, and made linkage decisions for cases for which rules had been developed. To bring

documents available for each template type 6 How to Access the Toolkit. ID NOW MARKETING TOOLKIT 7 Toolkit Templates: Printable MAILER POSTER SHELF TALKER/ SIGN. ID NOW MARKETING TOOLKIT 8 Toolkit Templates: Digital SOCIAL MEDIA AD SOCIAL MEDIA POST WEB CONTENT BLOCKS

ASTM International, 100 Barr Harbor Drive, P.O. Box C700, West Conshohocken, PA 19428-2959. . Second Revision No. 2-NFPA 501-2016 [ Section No. D.1.2.2 ] D.1.2.2ASTM PublicationPublications. ASTM International, 100 Barr Harbor Drive, P.O. Box C700, West Conshohocken, PA 19428-2959. ASTM E903, Standard Test Method for Solar Absorptance, Reflectance, and Transmittance of Materials Using .