Basics
Record linkage refers to the task of finding identical entries in two or more files. The initial idea goes back to Simon Newcomb, who tried to link two files containing personal data.
In 1969, Felegi and Sunter formalized these ideas. Their pioneering work "A Theory For Record Lankage" is, still today, the mathematical tool for any record linkage application.
Mathematical Model
In an application with two files, A and B, denote the rows (records) by α(a) in file A and β(b) in file B. Assign K characteristics to each record. The set of records that represent identical entities is defined by
and the complement of set M, namely set U representing different entities is defined as
.
A vector, γ is defined, that contains the coded agreements and disagreements on each characteristic:
where K is a subscript for the characteristics (sex, age, martial status, etc.) in the files. The conditional probabilities of observing a specific vector γ given
,
are defined as
and
respectively.