Data privacy is an increasing concern among organizations. When the data is financial or contains personally identifiable information (PII) of the customer or, needs to follow specific compliance and regulations, data protection becomes a crucial activity. While, breakthroughs in machine learning techniques using big data makes it essential to share the data with the research teams (inhouse or outsourced), improper use of such data can have adverse consequences. With GDPR compliance mandates in place, enterprises have to make sure that they follow strict guidelines and best practices not only to mask their customers’ sensitive information but also their business data before it’s misused by anyone.
Traditional data encryption techniques on high-risk data along with proper access rights are not foolproof, and when encryption key is compromised, it can expose almost every bit of it. Furthermore, corporates have diverse set of data, with huge volume and the pace at which it is growing, one encryption technique does not fit all. Pseudonymisation is a method advocated in the GDPR that increases data security and its privacy. It works well with larger sets of data and consists of stripping PII from snippets of data but still leaves traces of real data that could potentially be used by the outer world. Corporates are working with data security experts to lay the foundation and design the roadmap to make their data secure in order to avoid any potential risk.
Differential privacy
Development of formal data privacy techniques such as Differential Privacy (DP) is adding an extra layer of security onto the obfuscated data. DP is a formal mathematical model of privacy and has been one of the state-of-the-art concepts for data privacyi . It is a powerful standard for data privacy, which allows systems to perform analysis while protecting sensitive data and guarantees that individual privacy is not violated even if his/her data is used in analysis.
DP technique adds mathematical noise into source but that does not change the output of the analysis significantly and hence the output of the analysis doesn’t reveal anything about a specific individual’s private informationii . Mathematically, DP not only promises that anyone seeing the result of analysis will essentially drive the same inference about any individual’s private information, whether his/her private information is included in the input or notiii but also guarantees privacy protection against a wide range of privacy attacksiv .
Wipro’s crowdsourcing platformv, Topcoderv, uses the power of the crowd to help enterprises replicate their data in such a manner that it not only breaks the ability to reverse engineer, but also preserves the key relationships within different datasets using DP techniques. Just with a small subset of input data, Topcoder’s crowd can rapidly develop machine learning models to create fully-privatized and unidentifiable datasets which allow organizations to share their data outside of their territory while maintaining its privacy and ensuring mitigation of financial, reputational and operational risks eventually.
Current applications of differential privacy
Almost all tech-driven companies are experimenting with differentially private implementations. Organizations like Google, Apple, Facebook, Uber, Netflix, are using DP techniques to protect their users’ sensitive information, events, location etc.
For instance, Google has recently developed the RAPPOR system, which applies differentially private computations to gather aggregate statistics from chrome browser users. The system allows Google team to monitor the wide-scale effects of malicious software on the browser settings of its users, while guaranteeing strong privacy to individualsvi .
With launch of iOS 10, Apple is using local DP to protect the privacy of user activity in a given time period, while still gaining significant insights to improve the intelligence and usability of features such as: QuickType suggestions, Emoji suggestions, Health Type Usage (iOS 10.2) etcvii .
To address potential for risk for data abuse, Facebook is working towards to employ DP techniques to build the URL database, which includes links that have been shared on the social network by multiple unique users with their privacy settings set to publicviii .
In cooperation with the University of California at Berkeley and DP technique, the real time ride-hailing app Uber, recently released an open-source tool allowing it and other companies to set limits on the number of statistical queries Uber staff can conduct on traffic patterns and drivers’ revenueix .
Microsoft has been applying DP techniques to mask the location of individuals in their geolocation databases to maintain privacy of its usersx .
Practical considerations
While DP is gaining momentum in practical scenarios and is proving to be a ‘gold mine’ of data privatization, the questions related to the level of accuracy vs privacy need to be addressed. Every time a user hits the database, the total ‘leakage’ increases and as and when further queries are made, this leakage could start adding up, which compromises data privacy.
The more information a user intends to ‘ask’, the more noise has to be injected in order to minimize privacy leakage and when more noise is added to the data, the less useful it becomes. So, there is always a fundamental trade-off between accuracy and privacy, which could compromise the training and performance of ML models eventually.
While there are quite some DP techniques which attempt to maximize the accuracy of the computation while keeping privacy constraint, there could be some product requirements where accuracy is a constraint, and privacy can be compromised. Organizations would need to work on designing the perfectly balanced differentially private mechanism for each use case to attain a good privacy-utility trade-off.
ihttps://journalprivacyconfidentiality.org/index.php/jpc/article/view/405
iihttps://dminc.com/blog/all-you-need-to-know-about-differential-privacy/
iiiKobbi Nissim, et al. Differential Privacy: A Primer for a Non-technical Audience. February 14, 2018.
ivCynthia Dwork and Aaron Roth. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9(3 4):211–407, 2014.
vihttp://googlepolicyeurope.blogspot.com/2015/11/tackling-urban-mobility-with-technology.html,
http://www.wired.com/2016/06/apples-differential-privacy-collecting-data/
viihttps://www.apple.com/privacy/docs/Differential_Privacy_Overview.pdf
ixhttps://iapp.org/news/a/uber-becomes-the-latest-company-to-embrace-differential-privacy/
xhttps://demystifymachinelearning.wordpress.com/2018/11/20/intro-to-differential-privacy/
Industry :
Saurabh Aggarwal
Managing Consultant - Data Analytics and AI Consulting practice, Wipro
Saurabh has over 12 years of experience in leading data science and AI projects across industries. He holds PDEng (Professional Doctorate in Engineering) from TU/e, Netherlands and Master of Technology from IIT Kanpur, India.
Saurabh can be reached at saurabh.aggarwal2@wipro.com