17

Oct 08

Estimation of Disclosure Risk in Sample Microdata Using Probabilistic Modelling

Disclosure risk occurs when there is a high probability that an intruder can reidentify an individual in released microdata and confidential information may be obtained. In order to make informed decisions about the release of microdata, we need objective disclosure risk measures that quantify the risk of re-identification. We assume that the microdata contains individuals investigated in a survey, such as the Labour Force Survey, and that the population is unknown (or only partially known through some marginal distributions).

The disclosure risk is a function of both the population and the sample, and in particular the cell counts for a contingency table defined by combinations of identifying discreet key variables, i.e. sex, age, occupation, etc. Based on probabilistic models, we estimate per-record disclosure risk measures: the probability that a sample unique in a cell of the contingency table is a unique in the population, and the probability that a unique in the sample is correctly matched to a record in the population. Per-record risk measures are used to target high-risk records for data masking techniques thereby minimizing information loss. Consistent global file-level disclosure risk measures are aggregated from the per-record risk measures and they include: the number of sample uniques that are population uniques, and the expected number of correct matches to the population. The global risk measures are particularly useful for setting thresholds and determining whether to release the microdata based on the mode of access.

In this talk, we start with a natural model proposed by Bethlehem, Keller and Pannekoek (1990) for estimating the disclosure risk measures based on the Poisson-Gamma distributions. We will connect this basic model to current models in the literature and provide a more general framework for probabilistic disclosure risk assessment. To estimate the parameters of the distributions we rely on log-linear modelling of the sample counts in the contingency table spanned by the key variables. These tables are very large and sparse and general asymptotic assumptions do not apply. More robust methods for model selection techniques and goodness of fit criteria have been developed. The models will be demonstrated on sample data drawn from the UK Census (where the population is known) and real data sets from the UK Office for National Statistics (ONS) with an emphasis on practical implementation

Send this topic to your friend
Your Name :      Your Email :     
Friend's Name :  Friend's Email : 
                                                        

Come in and find...


Seminartopics.net is a site that can provide you with unique insights into a large number of latest technical seminar topics for engineering as well as science subjects. We don’t generally provide a ready-made seminar-kit, but a synopsis, which we believe will greatly help you to select a semianr topic that matches the above said criteria.

Search