Preserving Confidentiality and Quality of Tabular Data: Are Safe Data Necessarily
Inferior Data?
Slide 1
Lawrence H. Cox, Associate Director
National Center for Health Statistics
LCOX@CDC.GOV
Bureau of Transportation Statistics Confidentiality Seminar
Washington, DC
September 17, 2003
PRESENTATION HANDOUT–DO NOT QUOTE OR
CITE
Slide 2
Statistical Disclosure Limitation (SDL) for Tabular Data
Tabular data
- frequency (count) data organized in contingency tables
- magnitude data (income, sales, tonnage, # employees, ..) organized
in sets of tables
Tables
- there can be many, many, many tables (national censuses)
- tables can be 1-, 2-, 3-, .........up to many dimensions
- tables can be linked
- table entries: cells (industry = retail shoe stores & location
= Washington DC)
- data to be published: cell values (first quarter sales for shoe
stores in Washington DC = $17M)
What is disclosure?
Count data: disclosure = small counts (1, 2, ...)
Magnitude data: disclosure = dominated cell value
Example:
Shoe company # 1: |
$10M |
Shoe company # 2: |
$6M |
Other companies (total): |
$1M |
Cell value: |
$17M |
# 2 can subtract its contribution from cell value and infer contribution of
#1 to within 10% of its true value = DISCLOSURE
Cells containing disclosure are called sensitive
cells
How is disclosure in tabular data limited
by statistical agencies?
- identify cell values representing disclosure
- determine safe values for these cells
Example: If estimation of any contribution to within 20% is safe (policy decision),
then a safe value above would be $18M
- traditional methods for statistical disclosure limitation
- Count data:
- rounding
- data perturbation
- swapping/switching
- cell suppression
What is cell suppression?
- replace each disclosure-cell value by a symbol (variable)
- replace selected other cell values by a symbol (variable) to prevent
narrow estimates of disclosure-cell values
- process is complete when resulting system of equations divulges no unsafe
estimates of disclosure-cell values
Some properties of cell suppression:
- based on mathematical programming
- very complex theoretically, computationally, practically
- destroys useful information
- thwarts many analyses; favors sophisticated users
How does cell suppression addresses data quality?
Cell suppression employs a linear objective function to control oversuppression
Namely, the mathematical program is instructed to minimize:
- total value suppressed
- total percent value suppressed
- number of cells suppressed
- logarithmic function related to cell values
- etc.
These are overall (global)
measures of data distortion
Further, individual cell costs or capacities can be set to control
individual (local) distortion
These are all sensible criteria and worth doing
However, they do not preserve statistical properties (moments)
Moreover, suppression destroys data
and thwarts analysis
Slide 3
Controlled Tabular Adjustment (CTA)
- new method for SDL in tabular data
- perturbative method–changes, does not eliminate, data
- alternative to complementary cell suppression
- attractive for magnitude data & applicable to count data
Original CTA Method (Dandekar and Cox 2002)
- identify sensitive tabulation cells
- replace each disclosure cell by a safe value–namely, move the cell
value down or up until safety is reached
- use linear programming to adjust nonsensitive values in order to restore
additivity (rebalancing)
- if second and third steps are performed simultaneously, a mixed integer
linear program (MILP) results. MILP is extremely computationally demanding
- otherwise (most often), the down/up decision is made heuristically, followed
by rebalancing via linear programming (LP). LP computes efficiently even for
large problems
Slide 4
(Nearly) Actual Example of Magnitude Table with Disclosures
167 |
317 |
1284 |
587 |
4490 |
3981 |
2442 |
1150 |
70 (21) |
14488 |
57(1) |
1487 |
172 |
667 |
1006 |
327 |
1683 |
1138 |
46 (7) |
6583 |
616 |
202 |
1899 |
1098 |
2172 |
3825 |
4372 |
300(40) |
787 |
15271 |
0 |
36(10) |
0 |
16(4) |
0 |
0 |
65 |
0 |
140(40) |
257 |
840 |
2042 |
3355 |
2368 |
7668 |
8133 |
8562 |
2588 |
1043 |
36599 |
Example 1: 4x9 Table of
Magnitude Data & Protection Limits for the 7 Disclosure Cells (red)
D |
317 |
1284 |
D |
4490 |
3981 |
2442 |
1150 |
D |
14488 |
D |
1487 |
172 |
667 |
1006 |
327 |
1679 |
D |
D |
6583 |
616 |
D |
1899 |
1098 |
2172 |
3825 |
4371 |
D |
787 |
15271 |
0 |
D |
0 |
D |
0 |
0 |
70 |
0 |
D |
257 |
840 |
2042 |
3355 |
2368 |
7668 |
8133 |
8562 |
2588 |
1043 |
36599 |
Example 1a: After Optimal
Suppression: 11 Cells (30%)
& 2759 Units (7.5%) Suppressed
167 |
317 |
1276 |
587 |
4490 |
3981 |
2442 |
1150 |
91 |
14501 |
56 |
1487 |
172 |
667 |
1006 |
327 |
1683 |
1138 |
39 |
6571 |
617 |
196 |
1899 |
1095 |
2172 |
3825 |
4372 |
260 |
797 |
15232 |
0 |
26 |
0 |
12 |
0 |
0 |
65 |
0 |
180 |
288 |
840 |
2026 |
3347 |
2361 |
7668 |
8133 |
8562 |
2548 |
1107 |
36592 |
Example
1b: After Controlled Tabular Adjustment
167 |
317 |
1284 |
587 |
4490 |
3981 |
2442 |
1150 |
70 (21) |
14488 |
57(1) |
1487 |
172 |
667 |
1006 |
327 |
1683 |
1138 |
46 (7) |
6583 |
616 |
202 |
1899 |
1098 |
2172 |
3825 |
4372 |
300(40) |
787 |
15271 |
0 |
36(10) |
0 |
16(4) |
0 |
0 |
65 |
0 |
140(40) |
257 |
840 |
2042 |
3355 |
2368 |
7668 |
8133 |
8562 |
2588 |
1043 |
36599 |
Example 1: 4x9 Table of
Magnitude Data & Protection Limits for the 7 Disclosure Cells (red)
167 |
317 |
1276 |
587 |
4490 |
3981 |
2442 |
1150 |
91 |
14501 |
56 |
1487 |
172 |
667 |
1006 |
327 |
1679 |
1138 |
39 |
6571 |
617 |
196 |
1899 |
1095 |
2172 |
3825 |
4371 |
260 |
797 |
15232 |
0 |
26 |
0 |
12 |
0 |
0 |
70 |
0 |
180 |
288 |
840 |
2026 |
3347 |
2361 |
7668 |
8133 |
8562 |
2548 |
1107 |
36592 |
Example 1b: Table After Controlled Tabular Adjustment
167 |
317 |
1276 |
587 |
4490 |
3981 |
2442 |
1150 |
91 |
14501 |
56 |
1487 |
172 |
667 |
1006 |
327 |
1683 |
1138 |
35 |
6571 |
617 |
202 |
1899 |
1098 |
2172 |
3825 |
4372 |
260 |
787 |
15232 |
0 |
20 |
0 |
9 |
0 |
0 |
65 |
0 |
194 |
288 |
840 |
2026 |
3347 |
2361 |
7668 |
8133 |
8562 |
2548 |
1107 |
36592 |
Example 1c: Table After Optimal Controlled Tabular Adjustment (Regression)
Slide 5
MILP for Controlled Tabular Adjustment (Cox 2000)
Original
data: nx1 vector a
Adjusted data: nx1 vector a + y + - y -
T
denotes the coefficient matrix for the tabulation equations
Denote y = y + - y -
Cells i = 1, ..., s are the sensitive cells
Upper (lower) protection for sensitive cell i denoted Pi(-Pi)
MILP for case of minimizing sum of absolute
adjustments
Subject to:
T (y) = 0
yi- = pi(l-Ii)
yi+ = piIi i
= 1, ... , s (sensitive cells)
0 ≤ yi- , yi+ ≤ei
, i = s+1, ..., n
(nonsensitive cells)
Ii binary, i = 1, ..., s
Capacities ei on adjustments to nonsensitive
cells typically
small, e.g., based on measurement error
Slide 6
Data Quality Issues
Based on mathematical programming, just like cell suppression CTA can minimize:
- total value suppressed
- total percent value suppressed
- number of cells suppressed
- logarithmic function related to cell values
- etc.
In addition, adjustments to nonsensitive cells can be restricted to lie within
measurement error
Still, this may not ensure good statistical
outcomes, namely,
analyses
on original vs adjusted data yield comparable results
Slide 7
Towards Ensuring Comparable Statistical Analyses
Verification of “comparable results” is mostly
empirical
Many, many analyses are possible: Which analysis
to choose?
Instead, we focus on preserving key statistics
and linear models
- mean values
- variance
- correlation
- regression slope
between
original and adjusted data
Can do this using direct (Tabu) search
I will describe how to do so well in most cases using
LP
For simplicity, assume that the down/up decisions for sensitive cells have
already been made (by heuristic)
Slide 8
Preserving Mean Values
When the LP holds a total fixed, it preserves the mean of the cell values
contributing to the total e.g., fixing the grand total preserves the overall
mean
In general, to preserve a mean, introduce (new) constraint: Σ (adjustments
to cells contributing to the mean) = 0
A criticism of CTA is that it introduces too much distortion into the values
of the sensitive cells
In general the intruder does not necessarily know which cells are sensitive
nor cares to analyze only sensitive data, so focusing on distortions to sensitive
values may be a bit of a red herring
Still, it is useful to demonstrate how to preserve the mean of the sensitive
cell values, as the method applies to preserving the mean of any subset of cells
Preserving the mean of the sensitive cell values is equivalent to constraining
net adjustment to zero:
If, as in the original Dandekar-Cox implementation, we allow only two choices
for yi , this is unlikely to be feasible
However, satisfying this constraint is not a problem if we simply expand the
set of possible y-values viz., if we permit slightly larger down/up adjustments
The MILP is:
min c(y)
Subject to:
T(y) = 0
pi(l - Ii) ≤ yi- ≤
qi(l - Ii)
piIi ≤ yi+
≤ qili i
= 1, ... , s
0 ≤ yi- , yi+ ≤ei
i = s+1, ..., n
qi are appropriate upper bounds on changes
to sensitive cells
c(y) is a linear cost function, typically involving sum of absolute
adjustments
If the down/up directions are pre-selected,
this is an LP
Slide 9
Preserving Variances
Seek: Var(a + y) _ Var(a), assuming
Var(a + y) = Var(a) + 2Cov(a,y) + Var(y)
Define L(y) = Cov(a,y)/Var(a)
L(y) is a linear function of the adjustments y
Var(a + y)/Var(a) = 2L(y) + (1 + Var(y)/Var (a))
|Var(a + y)/Var(a) - 1 |=| 2L(y) + (Var(y)/Var(a))|
Var(y) is nonlinear, but can be linearly approximated
Alternatively: typically Var(y)/Var(a)
is small
Thus, variance is approximately preserved by minimizing | L(y) |
The absolute value is minimized as follows:
* incorporate two new linear constraints in the system:
w ≥ L(y)
w ≥ - L(y)
* minimize w
Slide 10
Assuring High Positive Correlation
Seek: Corr(a,a + y) _ 1
Corr (a, a + y) = Cov(a, a + y) ÷ √ Var(a)
Var(a + y)
After some algebra,
Corr (a, a + y) = (l + L(y)) ÷ √ Var(a
+ y) / Var(a)
Again: min | L(y) | yields a good approximation because
it drives both numerator and denominator to one
Slide 11
Assuring Slope of Regression Line(s)
Seek: under ordinary least squares regression
Y = β1 X + β0
of adjusted data Y = a + y on original data X = a,
we want: β1 _ l and β0 _0
As
, then β0 _ 0 if β1 _ l
This corresponds to L(y) _ 0 (if feasible)
Note again: this is achieved via min | L(y) |
Slide 12
The Compromise Solution
Variance is preserved by minimizing L(y)
Correlation is preserved by minimizing L(y)
Regression slope preserved by L(y) _ 0 (if feasible)
All subject to
If Var(y)/Var(a) is small (typical case), imposing objective
function min | L(y) | assures good results simultaneously
- for variance
- for correlation
- for regression slope
Shortcut is to incorporate the constraint
L(y) = 0 (if feasible)
Choosing L(y) _ 0 is motivated statistically because it
implies (near) zero correlation between values a and adjustments y
viz., as solutions y and -y are interchangeable, this correlation
should be zero
Slide 13
Examples
4x9 Table
167500 |
317501 |
1283751 |
587501 |
4490751 |
3981001 |
2442001 |
1150000 |
70000 |
14490006 |
56250 |
1487000 |
172500 |
667503 |
1006253 |
327500 |
1683000 |
1138250 |
46000 |
6584256 |
616752 |
202750 |
1899502 |
1098751 |
2172251 |
3825251 |
4372753 |
300000 |
787500 |
15275510 |
0 |
35000 |
0 |
16250 |
0 |
0 |
65000 |
0 |
140000 |
256250 |
840502 |
2042251 |
3355753 |
2370005 |
7669255 |
8133752 |
8562754 |
2588250 |
1043500 |
36606022 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
21000 |
625 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
7800 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
40000 |
0 |
0 |
10500 |
0 |
4875 |
0 |
0 |
0 |
0 |
42000 |
Table 1: 4x9 Table of Magnitude Data and Protection Limits for Its Seven
Sensitive Cells (in red)
166875 |
307001 |
1283751 |
587501 |
4490751 |
3981001 |
2442001 |
1150000 |
91000 |
14499881 |
56875 |
1487000 |
172500 |
667503 |
1006253 |
327500 |
1683000 |
1141875 |
38200 |
6580706 |
616752 |
202750 |
1899502 |
1103626 |
2172251 |
3825251 |
4372753 |
260000 |
816300 |
15269185 |
0 |
45500 |
0 |
11375 |
0 |
0 |
65000 |
36375 |
98000 |
256250 |
840502 |
2042251 |
3355753 |
2370005 |
7669255 |
8133752 |
8562754 |
2588250 |
1043500 |
36606022 |
167500 |
317501 |
1283751 |
587501 |
4490751 |
3981001 |
2442001 |
1150000 |
91003 |
14511009 |
55625 |
1487000 |
172500 |
667503 |
1006253 |
327500 |
1683000 |
1146675 |
38200 |
6584256 |
616752 |
202750 |
1899502 |
1098751 |
2172251 |
3825251 |
4372753 |
260000 |
787498 |
15235508 |
0 |
18791 |
0 |
8125 |
0 |
0 |
65000 |
0 |
191756 |
283672 |
839877 |
2026042 |
3355753 |
2361880 |
7669255 |
8133752 |
8562754 |
2556675 |
1108457 |
36614445 |
167500 |
317501 |
1283751 |
587501 |
4490751 |
3981001 |
2442001 |
1129000 |
91000 |
14490006 |
55313 |
1499637 |
172500 |
667503 |
1006253 |
327500 |
1683000 |
1138250 |
34300 |
6584256 |
616752 |
202750 |
1899502 |
1098751 |
2172251 |
3825251 |
4372753 |
359884 |
787500 |
15335394 |
937 |
19250 |
0 |
8938 |
0 |
0 |
65000 |
0 |
94815 |
188940 |
840502 |
2039138 |
3355753 |
2362693 |
7669255 |
8133752 |
8562754 |
2627134 |
1007615 |
36598596 |
167500 |
317501 |
1276439 |
587501 |
4490751 |
3981001 |
2442001 |
1150000 |
91000 |
14503694 |
55625 |
1487000 |
172500 |
667503 |
1006253 |
327500 |
1683000 |
1138250 |
34420 |
6572051 |
616752 |
202750 |
1899502 |
1106063 |
2172251 |
3825251 |
4372753 |
260000 |
787500 |
15242822 |
0 |
19250 |
0 |
8938 |
0 |
0 |
65000 |
0 |
194267 |
287455 |
839877 |
2026501 |
3348441 |
2370005 |
7669255 |
8133752 |
8562754 |
2548250 |
1107187 |
36606022 |
Table 2: Original Table After Various Controlled Tabular Adjustments Using
Linear Programming to Preserve Statistical Properties of Sensitive Cells
Only
Slide 14
Results for 4x9 Table
Summary: 4x9 Table Linear Programming
min | yi | |
0.98 |
0.82 |
0.70 |
min |L-Bound| (Var.) |
0.95 |
0.93 |
0.94 |
max L (Cor.) |
0.97 |
1.20 |
1.52 |
min |L| (Reg.)* |
0.95 |
0.93 |
0.95 |
All 4 Functions |
1.00 |
1.00 |
1.00 |
Table 3: Summary of Results of Numeric Simulations on 4x9 Table Using
Linear Programming
* = compromise solution
Slide 15
Results for 13x13x13 (Dandekar) Table
Summary: 13x13x13 Table Linear Programming
min | yi | |
0.995 |
0.96 |
0.94 |
min |L-Bound| (Var.) |
0.995 |
1.00 |
1.00 |
max L (Cor.) |
0.995 |
1.00 |
1.21 |
min |L| (Reg.)* |
0.995 |
1.00 |
1.01 |
All 4 Functions |
1.00 |
1.00 |
1.00 |
Table 4: Summary of Results of Numeric Simulations on 13x13x13 Table
Using Linear Programming
* = compromise solution
Slide 16
Concluding Comments
- statistical agencies have responsibilities
- to respondents (to maintain confidentiality)
- to data users (to deliver high-quality data products)
- these responsibilities
- are often in opposition
- nevertheless, are not mutually exclusive
- have, in the past, been approached separately
- research indicates these responsibilities can be addressed
- simultaneously
- using systematic, computationally efficient methods
|