Write note on Record Blocking

173views

written 6.7 years ago by

devikaraniroy • 640

“Blocking” is the process of grouping similar-seeming records into blocks that a machine learning component then explores exhaustively. In many blocking approaches, records are grouped together into blocks by shared properties that are indicators of duplication.

However, when dealing with very large data sources, it is nearly impossible to determine any fixed set of properties at training time that will be optimal for the distribution of values for these properties that we will encounter at run time.

The process starts by collecting billions of personal recordsthese records might include name, address, birthday, phone number, (encrypted) social security number, job title, and university attended. Note that different records will include different subsets of these example fields.

After collection and categorization, the Record Linkage process should link together all records belonging to the same real-world person.

Our system follows the standard high-level structure of a record linkage pipeline by being divided into four major components:

• data cleaning

• blocking

• pairwise linkage

• Clustering.

First, all records go through a cleaning process that starts with the removal of bogus, junk and spam records.Then all records are normalized to an approximately common representation.

Finally, all major noise types and inconsistencies are addressed, such as empty/bogus fields, field duplication, outlier values and encoding issues.At this point, all records are ready for subsequent stages of Record Linkage.

The blocking step, group’s records by shared properties to determine which pairs of records should be examined by the pairwise linker as potential duplicates. Next, the linkage step assigns a score to pairs of records inside each block using a high precision machine learning model.

If a pair scores above a user-defined threshold, the records are presumed to represent the same person. The clustering step first combines record pairs into connected components and then further partitions each connected component to remove inconsistent pair-wise links.

Hence at the end of the entire record linkage process, the system has partitioned the input records into disjoint sets called profiles, where each profile corresponds to a single person.

ADD COMMENT EDIT