Evaluating Linkage via Synthetic Data

‘Gold-standard’ linked data to evaluate linkage algorithms is scarce, particularly at realistic scale. Synthetic data has the advantage that all the true links are known. In the domain of demographic population reconstruction, the ability to synthesise populations on demand, with varying characteristics, allows a linkage approach to be evaluated across a wide range of data sets.

We are developing a micro-simulation model for generating such synthetic populations, taking as input a set of desired statistical properties. We validate the presence of these desired properties in the generated populations and use them to evaluate linkage algorithms, examining how linkage quality varies across a range of population types: with the same characteristics, with differing characteristics, and with various types of errors in the raw data.