Comment: The search for real world datasets - it’s a lot harder than it seems

AI needs access to a vast amount of real world datasets if it’s to reach its full potential, but in reality this can be a lot harder than it seems, says Chris Longstaff, VP product management at Mindtech.


There is a real issue in finding real world data sources that are suitable for commercial use, especially for machine learning training. Even when these datasets are on offer, you must factor in the biases and the limited scope of availability. AI specialists NVIDIA pulled together a dataset of 70,000 images of human faces from Flickr. Notably, the company states two things in particular. Firstly - and crucially -  that ‘only images under permissive licenses were collected’ and that the subsequent compiled dataset is ‘made available under the Creative Commons BY-NC-SA 4.0 license’. Secondly, it ascertains that this dataset can be used for non-commercial purposes only.

The general assumption has always been that if you are part of a commercial organisation, then anything labelled ‘NC’ (non-commercial) is out of bounds. But NVIDIA was using images labelled as NC. So does that mean there is a difference between using these images purely for research purposes and then for product deployment? And if an image has a permissive licence on a website like Pixabay, can we assume that these can then be used for ML training?

It feels like there is a grey area to explore in this remit. Key points may rest on the use of data for research, training and testing, and if indeed we should distinguish.

Why is there a lack of clarity?

For general commercial use, the rules are clear. NC images are non-commercial and therefore cannot be used for commercial purposes. But when it comes to ML training, the supposed commercial ‘rules’ are less clear. From a personal point of view, my general assumption has been that anything marked NC is off limits, as the by-product of this work is leading towards the company’s profitability.

Is having an ‘off limits’ policy for NC images for research actually – and unnecessarily – impeding the development of research teams? In the current landscape, we take caution even when using images allowed for commercial use. However, as these images will never be outwardly seen or used by the company for direct commercial gains, does this make it okay to use them for this purpose?

Should licence use for data differ for training purposes vs testing?

NVIDIA is a global player in the field. So, can we make assumptions that just because an image has  a permissive licence that we can use it for ML training? Or is this a case of ‘if they can do it then why can’t we’?

From a training standpoint, it makes sense that using these images could open up a whole new remit of real-world data to help accelerate innovation in the sector. Yet it is understandable that wariness over using these datasets remains. Have the identifiable people in, for example, some of the “open“, commercial use allowed sites signed model releases? And not only model releases for general use, but model releases that include use for AI purposes?

For the latter, it seems extremely unlikely. This is put under even more scrutiny when it comes to images of minors. Theoretically the parents should have given their permissions. Could these minors indeed later withdraw consent, once they have reached the age of majority?

When the problem is set out like this, it feels like there needs to be a system that contains both evidence of releases and one that can give permissions for AI work in particular. In this format, wording would be phrased to permit the use of images to help train networks, but for them never to be published further.

Can synthetic data bypass ‘the licence issue’ altogether?

There is a whole range of licence issues that make the use of real-world data that much harder. Licences are there for a reason, and consent should and must be given for using images. As a result, can we really gather enough data from using real-world datasets and images alone? It is the very nature of legal reasons at play that gives even further weight to the use of synthetic data.

Synthetic data is artificially generated by computer systems to match real world data. Its rise has emerged from a desire to reduce relying on real world datasets, creating a dramatically more cost-efficient and quicker means of producing large swathes of data that don’t carry the tag of the ‘licence issue’. This means ML training can advance quicker, with greater datasets. The more data that is created artificially, the more it will be able to match up to real world use cases.

Synthetic data represents a tangible answer to filling the real-world data gap. For now, it cannot solve the issue alone - a combination with real world data is required. But its use and increasing prominence creates a whole new world of possibilities for ML training that can bypass the tricky search for real world datasets.

Chris Longstaff, VP product management at Mindtech