The 21st Century Challenge of Privacy and Confidentiality for ...

The 21st Century Challenge of Privacy and Confidentiality for ...

Differential Privacy: Some Basic Concepts Simson L. Garfinkel Senior Scientist, Confidentiality and Data Access U.S. Census Bureau May 1, 2019 Formal Methods for Statistical Software (FMfSS) The views in this presentation are those of the author, and not those of the U.S. Census Bureau. 1 Acknowledgments This presentation is based on numerous discussions with John Abowd (Chief Scientist) Dan Kifer (Scientific Lead) Salil Vadhan (Harvard University) Robert Ashmead, Philip Leclerc, William Sexton, Pavel Zhuravlev (US Census Bureau) 2 Statistical agencies collect data under a pledge of confidentiality. We pledge: Collected data will be used only for statistical purposes. Collected data will be kept confidential. Data from individuals or establishments wont be identifiable in any publication. Fines and prison await any Census Bureau employee

who violates this pledge. 3 Statistical agencies are trusted curators. Respondents Confidential Database Published Statistics Age Sex Race/MS # Median Age Mean Age 8 FBS Total 7 30 38 18 MWS

Women 4 30 33.5 24 FWS Male 3 30 44 30 MWM Black 4 51 48.5 36 FBM White

3 24 24 66 FBM Married 4 51 54 84 MBM Black F 3 36 36.7 4 We now know trusted curator model is more complex. Every data publication results in some privacy loss. Respondents

Confidential Database Published Statistics Publishing too many statistics results in the compromise of the entire confidential database. 5 Consider the statistics from a single household Count Median Mean Total 1 24 24 # Female 1 24

24 # white 1 24 24 Single 1 24 24 White F 1 24 24 24 yrs Female White Single (24 FWS) 6 Publishing statistics for this household alone would result in an improper disclosure.

24 yrs Female White Single (24 FWS) Count Median Mean Total (D) (D) (D) # Female (D) (D) (D) # white (D) (D) (D)

Single (D) (D) (D) White F (D) (D) (D) (D) Means suppressed to prevent an improper disclosure. 7 In the past, statistical agencies aggregated data from many households together into a single publication. Count Median Age Mean Age

Total 7 30 38 # Female 4 30 33.5 # male 3 30 44 # black 4 51 48.5

# white 3 24 24 Married 4 51 54 Black F 3 36 36.7 8 We now know that this publication can be reverse-engineered to reveal the confidential database. 66 FBM & 84 MBM 30 MWM & 36 FBM

8 FBS 18 MWS 24 FWS Count Median Mean Total 7 30 38 # Female 4 30 33.5 # male 3

30 44 # black 4 51 48.5 # white 3 24 24 Married 4 51 54 Black F 3

36 36.7 This table can be expressed by 164 equations. Solving those equations takes 9 0.2 seconds on a 2013 MacBook Pro. Faced with database reconstruction, statistical agencies have just two choices. Option #1: Publish fewer statistics. Option #2: Publish statistics with less accuracy. 10 The problem with publishing fewer statistics: its hard to know how many statistics is too many. Count Median Mean Total 7 30

38 Solution #1 Solution #2 # Female 4 30 33.5 8 FBS 2 FBS # male 3 30 44 18 MWS 12 MWS # black

4 51 48.5 24 FWS 24 FWS # white 3 24 24 30 MWM 30 MBM Single XXXX XXXX 36 FBM 36 FWM

Married 4 51 54 66 FBM 72 FBM Black F 3 36 36.7 84 MBM 90 MBM Black M XXXX XXXX XXXX

White M XXXX XXXX XXXX White F XXXX XXXX XXXX XXXX 11 Faced with database reconstruction, statistical agencies have just two one choice. Option #1: Publish fewer statistics. Option #2: Publish statistics with less accuracy. 12 Differential privacy gives us a mathematical approach for balancing accuracy and privacy loss. Accuracy

Tradeoff between Accuracy and Privacy Loss 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% No privacy No accuracy Privacy Loss (epsilon) Differential privacy is really two things 1 A mathematical definition of privacy loss. 2 Specific mechanisms that allow us to: Add the smallest amount of noise necessary for a given privacy outcome Structure the noise to have minimal impact on the more important statistics 14 24 yrs Female White Single (24 FWS) NOISE BARRIER

Differential privacy the big idea: Use noise to create uncertainty about private data. 35 yrs Female Black Single (35 FBS) Impact of the noise impact of a single person Impact of noise on aggregate statistics decreases with larger population. 15 1 person age 22 10 people, all age 22 100 people, all age 22 NOISE BARRIER Understanding the impact of noise: (Statistics based on 10,000 experiments, epsilon=1.0) 50% runs 95% runs Median(age): 9 73 Median(age): 0 104 Median(age): 17 61 Median(age): 0 103

Median(age): 21 22 Median(age): 21 22 16 1 person age 22 10 people, all age 22 100 people, all age 22 NOISE BARRIER The noise also impacts the person counts. 50% runs 95% runs Median(age): 9 73 # people: -9 11 Median(age): 0 104 # people: -29 30 Median(age): 17 61 # people: 0 20 Median(age): 0 103 # people: -19 38 Median(age): 21 22

# people: 90 110 Median(age): 21 22 # people: 71 129 17 The 2020 census and differential privacy 18 The Disclosure Avoidance System allows the Census Bureau to enforce global confidentiality protections. DRF Decennial Response File CUF Census Unedited File CEF Census Edited File Confidential data DAS Global Confidentiality Protection Process

Disclosure Avoidance System Privacy-loss Budget, Accuracy Decisions MDF Pre-specified tabular summaries: PL94-171, DHC, detailed DHC Microdata Detail File Special tabulations and post-census research Public data 19 The Disclosure Avoidance System relies on injects formally private noise. Advantages of noise injection with formal privacy:

Transparency: the details can be explained to the public Tunable privacy guarantees Privacy guarantees do not depend on external data Protects against accurate database reconstruction Protects every member of the population Challenges: Global Confidentiality Protection Process Disclosure Avoidance System Entire country must be processed at once for best accuracy Every use of confidential data must be tallied in the privacy-loss budget 20 There was no off-the-shelf system for applying differential privacy to a national census We had to create a new system that: Produced higher-quality statistics at more densely populated geographies Produced consistent tables We created new differential privacy algorithms and processing systems that: Produce highly accurate statistics for large populations (e.g. states, counties) Create privatized microdata that can be used for any tabulation without additional privacy loss Fit into the decennial census production system

21 The 2020 DAS produces highly accurate data when blocks are aggregated into tracts 99% accuracy 95% accuracy epsilon the privacy loss parameter 22 Two public policy choices: What is the correct value of epsilon? Where should the accuracy be allocated? 23 For more information Can a set of equations keep U.S. census data private? By Jeffrey Mervis Science Jan. 4, 2019 , 2:50 PM Communications of ACM March 2019 Garfinkel & Abowd

http://bit.ly/Science2019C1 24 More Background on the 2020 Disclosure Avoidance System September 14, 2017 CSAC (overall design) https://www2.census.gov/cac/sac/meetings/2017-09/garfinkel-mode rnizing-disclosure-avoidance.pdf August, 2018 KDD18 (top-down v. block-by-block) https://digitalcommons.ilr.cornell.edu/ldi/49/ October, 2018 WPES (implementation issues) https://arxiv.org/abs/1809.02201 October, 2018 ACMQueue (understanding database reconstruction) https://digitalcommons.ilr.cornell.edu/ldi/50/ or https://queue.acm.org/detail.cfm?id=3295691 25

Recently Viewed Presentations

  • Alignment Using a Beam Triangle - University of Arizona

    Alignment Using a Beam Triangle - University of Arizona

    Alignment Using a Beam Triangle. Opti 521. Phil Scott. Presentation Overview. Defining an Optical Axis. Optical Axis Woes. Defining a Mechanical Axis. Mechanical Axis Woes. Degrees of Freedom for an Iris. Setting up a Beam Triangle. Degrees of Freedom for...
  • Is it evolution? - Mr. Shimko's Science Page

    Is it evolution? - Mr. Shimko's Science Page

    Is it evolution? Which person is correct? Martin: Natural selection means that organisms can try to adapt to their environments in order to survive. If successful, they will pass on their genes to their offspring and their traits will remain...
  • Review - Qatar Canadian School

    Review - Qatar Canadian School

    Significant Digits. Precision is the place value of the last measurable digit and is determined by the instrument. 12.0 g vs 12 g vs 10 g. Non-zero numbers are significant
  • DNA Mutations - Woodland Hills School District

    DNA Mutations - Woodland Hills School District

    A synonym is a word having the same or nearly the same meaning as another; so a synonym mutation is a different codon that ... The base pair change results in a STOP codon being produced. This may form a...
  • Writing Process - Central Bucks School District

    Writing Process - Central Bucks School District

    Color Code Check. On your own paper: Use Blue to mark your supporting evidence from the book or the movie. Use Red to mark how you introduce/embed the quote. Use Green to mark where you explained how the quote supports...
  • REVIEW: Time and Diurnal Motion Review Questions Updated:

    REVIEW: Time and Diurnal Motion Review Questions Updated:

    REVIEW: Time and Diurnal Motion Review Questions Updated: Feb 5, 2017 Question 1 Which constellation is not circumpolar (at Hayward)? (a) Ursa Major (b) Ursa Minor (c) Orion (d) Cepheus (e) Cassiopea * Question 2 The the star Sirius is...
  • Vocabulary Workshop - augusta.k12.va.us

    Vocabulary Workshop - augusta.k12.va.us

    Avert (v.) to turn aside, turn away; to prevent, avoid. Synonym: stop, deflect, ward off, preclude. Antonym: invite, induce, provoke, cause
  • Leadership and Trust - Pearson Education

    Leadership and Trust - Pearson Education

    Chapter 11 Leadership and Trust Identifiable Leadership Behaviours Autocratic Democratic Laissez-faire Formal Studies of Behavioural Styles Charismatic Leadership Entrepreneurs and Visionary Leadership Leader of a jazz ensemble Draws out the best of everyone Driving force through the early stages of...