Comment: Three questions for Data Privacy Week

With the soaring use of generative AI and large language models over the past year – and regulators scrambling to keep pace – it feels as if Data Privacy Week is taking on an extra sense of importance for 2024, says Ben Wells, head of statistical data science, Bayezian.

The New York Times is currently suing Microsoft and OpenAI, claiming it breached copyright training ChatGPT and similar tools
The New York Times is currently suing Microsoft and OpenAI, claiming it breached copyright training ChatGPT and similar tools - AdobeStock

There are concerns about the data used to train AI models. The New York Times is currently suing Microsoft and OpenAI, claiming it breached copyright training ChatGPT and similar tools. OpenAI has said it would be impossible to train models without copyrighted material. 

There are concerns about the data collected by models when they are used by individuals or businesses. Anything you input into an AI tool is used by the model to ‘learn’ and ‘improve’. As such, the general advice is to tread carefully and not use any personal information, with data visible and open to leaks.

Considering these industry issues, we set out three questions for Data Privacy Week:

Is there an answer to the AI training issue?

A growing number of ongoing copyright lawsuits directed at OpenAI, not only from The New York Times but also from a whole host of top authors, emphasises the need for a sensible resolution.

With OpenAI admitting their models must be trained via copyrighted material, it’s less so about seeking damning evidence but how those who own the works are reimbursed. OpenAI has raked in a massive $1.3bn in revenue in just one year. Despite the huge sum, none of the money has made its way to the creators of those works. 

One idea that’s been floated, supposedly by the developers themselves, is for copyright laws be curtailed for these particular chatbots. But creators will only continue to feel cheated and a move such as this paves the way for copyright infringements of other kinds in the future.

It seems the only solution is to make LLM models pay for the data they use. But how much? Due to the nature of these bots, it will not be possible to determine which datasets have been used for which query. This could mean that a flat fee will have to be paid for the training data. Whether this will be accepted by the creators remains to be seen.

Doesn’t social media already use AI to know us better than ourselves anyway?

Chatbots and the like can take our inputted data and use it to improve the service they offer to others. But how personal data is intended to be used should be clearly set out, not hidden within a small terms and conditions link that no one will read. Unfortunately, this is often the case.

It’s critical that data must also be anonymised before it can be seen by a human or the training algorithm, as well as stored securely and deleted after full use or the time agreed in the terms and conditions. Additionally, the developer must also ensure that collected data cannot be accessed by another user of the LLM.

One issue that can arise in the case of LLMs is that it’s often difficult to explain or quantify what the data is being used for, perhaps with greater transparency around this required.

Of course, users should also have the option for their data to not be used; this is the case in GPT 3.5 and 4. However, the default setting is that any inputted data can be used for training and, once entered, it cannot be taken back from the model.

How should companies and regulators act to ethically manage AI’s growth?

As the use of LLMs grows more widely, companies must be held accountable for the models they create, meaning that they must implement the necessary safeguards to stop their models being used for nefarious actions.

The safe management of the use of AI is as important as turbocharging innovation around it. Unfortunately the burden of behaving in an ethical manner is likely to fall on private companies with governments being too sluggish and requiring global consensus to avoid companies simply relocating.

Currently, there are only a handful of companies making powerful enough GenAI and LLMs to be useful to business and consumers; these companies will hopefully put in sufficient guard rails and work together to stop their tools being used by bad actors to influence global or private affairs.

The EU AI Act is a good start at introducing some general ground rules. These do not appear to be too stringent to stifle growth but it might be tricky to police if people move offshore. We can only hope that other nationals will follow suit as there does seem to be a good appetite for making widespread regulations currently.


Ben Wells, head of statistical data science, Bayezian