Babylonian Pearl

Vulnerability of Deidentified Data and AI

Note: this is an incomplete re-write of an oral essay from March 2023

7 March 2023

Artificial Intelligence can currently assemble data from multiple sources and re-identify a person’s identity from de-identified materials. Does this render current legal protections of privacy obsolete?

No, not entirely. There are limited, yet available options and there are key changes happening that may address some of the significant gaps I will discuss. It is true the ability to control re-identification of public datasets is nearly impossible as AI develops more - and the Privacy Act 1988 and other related legislation has some significant limitations.

Re-identification

Re-identification is a useful method and has a broad range of applications from crowd traffic management to healthcare research. On the other hand, re-identification of individuals can have some serious consequences when, for example, private health or financial information is recovered that could lead to economic loss, fraud, discrimination, embarrassment or identity-theft.

There are particular problems with AI and re-identification. Firstly, it doesn’t take much to apply a de-anonymization methodology to public data. A couple of computer scientists at the University of Texas did so to a dataset of anonymous movie ratings of 500,000 subscribers of Netflix - demonstrating that an adversary who knows only a little bit about an individual subscriber can easily identify this subscriber’s record in the dataset, revealing the identity of users and uncovering their apparent political preferences and other potentially sensitive information. [1]

Further, an artificially intelligent machine will draw connections wherever they may appear. Re-identification can be consequential to the AI performing its task, rather than a human making a conscious decision to re-identify.

A third issue with AI is that, in addition to structured data like that found in spreadsheets, it can reidentify far more common unstructured data, such as text documents or images, and the legal requirements that must be fulfilled to properly anonymize such data formats remain unclear.

What is on the table

The Privacy Act covers re-identified information - because any information which is re-identified from an apparently de-identified dataset will be personal information, and must be handled again by subject entities in accordance with the Australian Privacy Principles (APPs).

Further, entities can be in breach the APPs when they fail to adequately de-identify before release - as seen with the Department of Health breach discussed in class. Entities are also required to allow an individual to deal with them anonymously or using a pseudonym where practical - but as you can imagine it is only practical in specific circumstances.

And I’ll just note, for malicious re-identification or data extraction acts, there are some cybercrime offences that can be relevant.[2]

An example of our framework in action in the face of re-indentification. The Privacy Commissioner conducted a joint investigation with its UK counterpart into Clearview AI, an app that was promoted to law enforcement agencies which scraped data, particularly images of individuals, from the internet and for biometric profiling - kind of like a reverse image search for human faces to re-identify individuals. In her 2021 decision, the Commissioner found a breach of several APPs, including with regard to quality, consent, and lawful and fair collection. Most interestingly, in finding this overseas company breached the Privacy Act, the Commissioner raised the issue that Clearview was monetising individuals’ data for a purpose entirely outside reasonable expectation of the individuals.[3]

Problems with what is on the table

1 - Definitions

What is de-identification? There are a spectrum of de-identification techniques which can result in information that is easily re-identifiable to irreversibly anonymised. Lower levels of de-identification, such as using a pseudonym in place of a name - may still include abstract identifiers, which - when combined with other information - can lead to the information becoming directly identifiable.[4]

Two of the more developed legal models have taken different approaches to this problem. The European Union;s GDPR has a stringent standard of anonymous information,for the data protections to not apply - proportionate to their broad definition of personal information or PII -. requiring not only directly identifiable information but indirectly identifiable information. This is supported by the broader definition of what personal information under the GDPR. , any information *relating* to an identified or identifiable natural person.

For subject entities, The California Consumer Protection Act requires *only* a reasonably unidentifiable threshold, but requires business processes and technical safeguards be implemented by the controlling entity.[5]

In the Australian context, de-identified data is by definition not personal information and thus not protected by the Privacy Act. The definition of personal information is more narrow than the GDPR.

In 2017, the Full Federal Court determined in Privacy Commissioner v Telstra Corp Ltd[6] that, in order for information to be about an individual, the individual must *be the subject matter of the information*. A mere link between the individual and the information would not be enough to bring the information within the scope of the Privacy Act.

In the Telstra case, the court found that metadata generated through the usage of a mobile phone, which identified the user's connections with cellular towers and visited URL addresses, was not personal information.

This means that such data can in theory be made public without regard to the Privacy Act; and may be easy for an AI system to re-identify with fairly simple data matching techniques.

The approach to de-identification in the Privacy Commissioner’s guidance is risk based - Whether information is personal or de-identified will depend on the context. Information will be de-identified where the risk of an individual being re-identified in the data is very low in the relevant release context (or “data access environment”).

The ‘context of release’ refers to matters such as who will hold and have access to the information, other information available to those recipients, and the practicality of identifying an individual using that information.

However, what is practical to achieve for Artificial Intelligence is far beyond a human’s. As acknowledged by Privacy Commissioner guidance, publically released data can be accessed by anyone in the world (including experts and people with specific knowledge or a high level of skill), so entities must consider the risks in that context when publishing to the internet.

The Department of Health MBS Data issue of 2016 was publicly discussed. The Commissioner noted in their investigation that they were uncertain whether de-identification of a unit level dataset of this size and detail is possible to an extent that would permit full public release, while still maintaining the utility of the data.[7]

While this flexible, risk-based approach means that it can account for the growing capabilities of AI in theory - whether entities have the specialist skills required to properly assess the risks and de-identify information to the level required to publish on the internet  is another question. In the Department of Health matter we’ve discussed, the commissioner noted they were uncertain whether deidentification is possible to an extent it would permit full release without being useless.

The fact that there are different standards of de-identification across jurisdictions makes this global issue difficult to resolve with legislation.

2 - Jurisdiction

The second problem I will cover with the current legal protections, is that even though the Privacy Act does apply to re-identified information, the entity or individual responsible for the AI’s actions may not be bound by the Privacy Act.

The problems of jurisdiction of the Privacy Act is: it does not cover entities outside of Australia that don't meet to extra-territorial application of the privacy act; a malicious act by an individual not defined as an organisation would not be subject to the privacy act; On foreign entities, Section 5B of the Privacy Act requires an “Australian link” for proceedings to be brought against a foreign entity under the Act.

If data is re-compiled to reasonably identify an individual, and an entity under the jurisdiction of the Privacy Act, it can be personal information and breaches are actionable.

Recent amendments passed[8] passed in December 2022 removed the requirement that to have this 'Australian Link'. Previously, the entity must collect or hold personal information in Australia in order for the Act to apply to that organisation, now it merely has to carry on business in Australia. This aligns the Privacy Act with the Consumer Law.

As mentioned in the seminar earlier, a procedural matter between the Information Commissioner and Facebook is before the High Court on this very question of an Australian Link and what constitutes carrying on business in Australia.[9]

Therefore, subject to the HC’s views, there is some hope that larger corporations that have the capacity to use AI to re-identify or profile individuals, despite their fragmented corporate structure, might be captured by the extra-territorial application of the Privacy Act.

Onto individuals. Generally, the Privacy Act doesn’t apply to an individual acting in their personal capacity. Using AI, all you need is a laptop, an internet connection and public datasets and you can start digging for personal information - and the individual doing so is not necessarily bound by the Privacy Act. However individuals can be organisations, especially where there is profit involved in the de-identification.

However, if the individual has an Australian Link carries out that act for a business, to sell that data or target individuals with marketing, the individual may be acting in their capacity as an organisation, and certainly  if they were making a neat turnover of $3million plus.[10]

As such, the Privacy Act's jurisdiction is limited, to some extent, by geography and it is limited by its application to certain organisations and government agencies - not individuals.

AI doesn't need an official business to run data-matching and profiling to re-identify people. Nevertheless, the recent expansion of the Privacy Act’s application is a positive step.

On the horizon

So, finally, what’s next? The Privacy Act has been recently reviewed by The Attorney General’s Department. In their recent report, They recommend:

Proposing to expand the definition of personal information to something similar to the GDPR - that information relating to individuals can be personal information, rather than just information about individuals

Introducing obligations relating to the security of de-identified information, and a prohibition on re-identifying except for specific cases.

Consulting on introducing a criminal offence for malicious re-identification of de-identified information where there is an intention to harm another or obtain an illegitimate benefit, with appropriate exceptions.

Introducing a statutory tort for invasion of privacy, that may be used in circumstances of malicious re-identification by an individual who is otherwise not subject to the privacy act. A tort, absent a common law action as mentioned by Arthur, may be an adequate alternative to introducing criminal offences for re-identification.

In order to harness AI for good, access to big data sets is required, particularly for diagnostic and predictive tools in health.

However, for governments and large institutions, this can be done in a safer way

The government has recently introduced the Data Availability and Transparency Act 2022, a scheme with an embedded privacy framework to enable Commonwealth bodies to share available datasets with universities, state governments and the like for research and development and policy purposes and to inform government policies.

One of the privacy protections in this new act requires that, when sharing de identified data, the required ‘data sharing agreement’ must prohibit the recipient from taking any action that may result in the data ceasing to be de-identified. It has some similarities with the US HIPPA safe harbour provision, but with broader application than health data.[11]

This is a good first step, as the beneficial AI access to de-identified datasets needs to be possible in a controlled environment.

Outside of law, there is traction in computer science for de-identifying techniques that can combat AI’s de-identification powers - including differential privacy procedures and federated learning techniques as explored in the Tschider reading. For example. adding noise to datasets to ensure that certain information cannot be recovered and keeping the data on your phone while contributing to research.

This progress outside of legislation is just as important as regulatory reform in light of the social, research and economic benefits in de-identified public datasets.

Finally, while we are economically and through social norms compelled to divulge our information online, the information Commissioner’s consideration of the reasonable expectation of individuals is a good step towards a stronger interpretation of the privacy protections under our legal framework.


[1] Robust De-anonymization of Large Sparse Datasets

[2] Parts 10.7 and 10.8, Criminal Code Act 1995

[3] https://www.oaic.gov.au/newsroom/clearview-ai-breached-australians-privacy

[4] Rothestein

[5] GDPRA Recital 26, CCPA

[6] Privacy Commissioner v Telstra Corporation Limited [2017] FCAFC 4 (19 January 2017) http://www8.austlii.edu.au/cgi-bin/viewdoc/au/cases/cth/FCAFC/2017/4.html

[7] See the OAIC decision here https://www.oaic.gov.au/privacy/privacy-assessments-and-decisions/privacy-decisions/investigation-reports/mbspbs-data-publication

[8] Privacy Legislation Amendment (Enforcement and Other Measures) Act 2022

[9] [Whether prima facie case appellant "carr[ied] on business in Australia" within meaning of 5B(3)(b) of Privacy Act – Whether prima facie case appellant "collected… personal information in Australia" within meaning of s 5B(3)(c) of Privacy Act.]

[10] - s6C PA

[11] [HIPPA Once these specific identifiers have been removed, the covered entity must have no actual knowledge that the remaining information could be used to identify the patient. If this “no actual knowledge” requirement has been satisfied, the PHI has been successfully de-identified under the safe harbor method.]