Privacy Ground Rules for Big Data
Corporate big data initiatives are collecting massive volumes of personal data, largely for marketing purposes. There is a great deal of confusion about how these practices fit with privacy regulations. Are there any clear ground rules for big data privacy?
The metadata problem
We live our daily lives around a set of passwords: passwords for our phones, bank accounts, computers, and emails, for work and school. Passwords are one of the most common means of protecting privacy. They are keys securing our personal information and property.
There is one type of personal data we cannot protect with passwords: the traces we leave every time we use the internet and phone networks. In effect, we can keep our personal information private, but not our metadata: data including network connections, internet searches, website navigation paths, and personal contacts. For example, we can erase browser cookies, but somewhere, there is a log of our browsing patterns or search engine keywords. Facebook tracks the sites that you visit, even if you don’t have a Facebook account. The web simply isn’t designed or configured to give users control over their metadata.
This doesn’t mean that companies are free to use metadata however they like. The public expects that their privacy will be respected. Think of a small town where no one locks their doors: people still expect that their neighbours won’t just walk into their house and look through their things, and trespassing is still a crime. At this point in time, most physical and data assets have been designed to offer security through passwords, locks, and security systems. It can be too easy to forget that privacy and property laws apply equally to personal property that is not secured.
Privacy “ground rules” for big data
We have said previously that big data is really consumer data: records of our interactions with websites, stores, companies, public institutions, social media platforms and so on. The definition of big data is still loose. “Big data” is used to describe both unstructured and structured data collected through various channels: online, through wifi and phone networks, and in the physical world through sensors or cameras. Most often, the data collected combines personal content and metadata. Companies are effectively gathering massive amounts of personal data while the ground rules for respecting privacy are still undefined. “Data lakes” – massive repositories of relatively unstructured data collected from one or several sources, often without a specific purpose in mind – are seen as an asset by companies that want a wide variety of data for potential future analysis, or for sale to other companies.
The ground rules for personal data, despite general confusion about how they apply in new contexts, have not actually changed. Fair information principles, the basis of most privacy legislation, state that organizations should only collect, store, use, and retain personal information for specific purposes to which individuals have consented. Any information, or combination of information, that is detailed enough to potentially identify a person is personal information and these rules apply.
What does this mean for big data? For many organizations, the main incentive for adopting big data is to have large volumes of information that can be analyzed for multiple purposes, shared or sold to third parties, and retained indefinitely. Privacy principles prohibit using personal information in this way. However, in most big data contexts there is no reason for anyone to access individual level records: people need statistics, not individual profiles. De-identification and anonymization methods offer ways of converting personal information holdings into aggregated, non-identifiable data that does not need the protections outlined in privacy principles. For big data to respect privacy it needs to stay “big” – in most circumstances, it should not be possible for anyone handling data to uncover personal information about specific individuals.