Access control is a keystone in conventional approaches to protecting personal information. Lately, however, there is a great deal of uncertainty about whether big data repositories such as “data lakes” do in fact contain personal information, and whether access control is possible in these unstructured environments. Does access control still have any relevance in the world of big data?
Do data lakes contain personal information?
Privacy laws are written to protect “personal information” held by organizations – that is, any information that could potentially identify specific individuals. Yet as larger and larger volumes of data are collected and aggregated through “big data” initiatives, the definition of personal information is getting fuzzy. Many corporations are creating “data lakes”: massive repositories of relatively unstructured data collected from one or several sources. These often contain a mix of metadata (user activity data, such as web search terms, IP addresses, or GPS data) and personal content (such as personal messages and social media posts) which, in combination, can very frequently identify individuals. For example, publicly available, searchable databases of Twitter activity map tweets by location – locations so specific that they can show which house a supposedly anonymous person was in when they posted a message. In the commercial realm, some retailers and entertainment venues have begun tracking customers’ movements through their buildings using smart phone MAC addresses, which can be considered personal information.
As these examples show, data lakes can contain extremely sensitive personal information. This data is not intended to be viewed by anyone – it is usually processed by computer algorithms – but too often, any employee involved in data analysis can access personal information about specific individuals. Privacy principles have traditionally emphasized access control: providing users with access only to the information that they need to do their work. But does access control have any meaning in unstructured big data environments?
Eroding informed consent
For social media companies, “privacy” usually means privacy from other users, not from corporations.
For example, the vast majority of Facebook users are unaware that advertisers can buy all posts or personal messages containing certain keywords: for example, a cosmetics company could buy posts and messages that include the words, “beautiful,” “skin,” or “makeup.” The company can then display an ad to both the senders and recipients of these messages. When anyone clicks on the ad, the company receives the full contents of their profile. In the end, a “private message” is stored in data banks or data lakes belonging to Facebook and to the advertiser, to be used for various consumer research and marketing purposes.
Is access control compatible with data lakes?
Data lakes are not set up to protect identity. They usually contain a mix of different types of personal data that in combination can often identify individuals. Privacy regulations demand that personal information be protected by access controls, yet data lakes are not structured for access control: usually, anyone with access has access to all of the data. Though the data may be intended for aggregate level analysis and not viewing of individual-level data, an employee intent on selling data to identity thieves or stalking an ex-spouse could do a great deal of damage.
Companies have taken different approaches to protecting privacy in the context of big data initiatives. Some will create several separate data lakes, often based on business function (operations, marketing, security and risk, etc.). This approach can help to ensure that data use aligns with the purposes for which data was collected. Companies also vary in their decisions on who will have access to data. Will it be a tier model? Will there be a super administrator with access to everything? How will departments share the data? How can access be separated?
Shifting from access control to use control
As we’ve stated before, the ground rules for privacy have not changed with the introduction of big data. Personal information needs to be protected by privacy safeguards including access control. We know that unstructured and semi-structured big data models do not lend themselves to access control. Fortunately, most of the purposes for which big data is used do not require access to individual-level personal information. Rather than access control, the concept of “use control” may be more appropriate to big data contexts: focusing not so much on limiting which data can be accessed, but in what form it can be accessed. Rather than locking up personal information, it is possible to blur it – to make sure that personal data is not viewable as information about an individual, but only as a pixel in a much larger picture.