AI and Sensitive Data: a Trust Problem

 
Photo by DJ Johnson on Unsplash.

Photo by DJ Johnson on Unsplash.

 

Granting access to data requires a high level of trust.

Today, anywhere in the world, when a researcher or a data scientist wants to train an algorithm to do machine learning and create a prediction model, s/he must usually begin by grouping or gaining access to an already constituted dataset. S/he observes these data, consults some descriptive statistics, and manipulates them, etc. At this point, a problem of trust arises; from the moment one accesses the data the only protections against an illegitimate use of it are the ethical stances of the data scientist and/or the law, upheld through contracts or data usage agreements. Ethics and the law, that is-- trust, which is at the heart of collaborative work. But is trust always enough?

The development of AI generates huge opportunities for progress...

No longer a secret, we can safely assert we are experiencing an Artificial Intelligence boom. Technologies developed today can automate data analysis with an increasing level of high predictive performance, at times even exceeding human capabilities on precise tasks or augmenting them with decision support tools. This progress opens up immense opportunities in domains that handle large masses of data in their activities (e.g.: medical diagnostics, forecasts of equipment failure, translations of texts or audio recordings, generation of designs, etc.). In health in 2018, the FDA licensed medical devices based on predictive models for the first time (examples: Aidoc, acute intracranial hemorrhage, Imagen OsteoDetect, wrist fractures, IDx, diabetic retinopathy), which marks the crossing of a very significant threshold.

 
Illustration article itnonline.com - Aidoc receives FDA clearance.

Illustration article itnonline.com - Aidoc receives FDA clearance.

 

...but data and ML competencies do not always meet!

Not all organizations (depending on their place in the value chain of their industry, their size, etc.) are able to design algorithms, collect and clean the necessary data, and train the predictive models that would allow them to improve the quality of their services, reduce their costs and production times, or innovate for their users. They need providers, partners, and tools. As such, thousands of specialized companies aim to fill this void  by developing AI software for all sectors of activity. We are witnessing a real gold rush in all segments of the value chain, from data processors to business applications!

However, collaborations on AI topics with specialized actors is easier in concept than in practice. In the areas where the manipulated data are very sensitive (for example: medical data, photographs of people’s faces, data representing the strategic assets of a company, sovereign data, etc.) problems arise: how to train AI algorithms without giving access to these very sensitive data? Processors of data (hospitals, administrations, companies, etc.) can not entrust their data to a third party without voluntarily or involuntarily creating a risk of illegitimate use. In practice, this risk pertains to leaks, theft, exposure of data; all generators of immense harm for the organizations responsible for the data, and for the persons these data concern.

 
Illustration article healthitsecurity.com - 10 biggest Healthcare Data Breaches of 2019.

Illustration article healthitsecurity.com - 10 biggest Healthcare Data Breaches of 2019.

 

How can we facilitate data science collaborations on sensitive data?

As mentioned earlier, often in the case of very sensitive data, trust between organizations is understood as necessary, but remains far from sufficient to permit access to third party research or data science projects. Much research, many projects, and plenty of potential discoveries do not see the light of day.

The questions we're asking are: how can organizations more systematically engage with sensitive data? How do we make sensitive data science more responsible and trustworthy?

The stakes are more projects, more discoveries and knowledge, more services developed from these data, constituting an extraordinary amount of raw material. For us this issue can only be addressed:

  • in a manner that protects the confidentiality of the data: we must find a way to make the data both available and private at the same time. Available at the aggregated scale to feed analysis and to drive models. Private at individual scale, with granular data, in all cases where it concerns personal or confidential information;

  • and through guaranteeing the traceability of operations of the ML carried out to develop the model (algorithm transfers, machine learning, model evaluation, etc.). In creating a kind of "genealogy" of predictive models, this allows us to consider auditing methods, certification models, and measurements of the respective contributions of different datasets to the performance of a model.

Substra Foundation is destined to explore these questions.

The raison d'être of Substra Foundation is to attempt to answer these questions. We are currently involved in research projects with HealthChain and MELLODDY consortiums, and driving the Substra open source initiative, while also taking part in the work of the emerging "trustworthy data science" community. We will continue to pursue the ideas merely glanced over in this article in our future posts. Until then, do not hesitate to send us your comments, questions or ideas; we’re eager to hear from you!