Q&A
How was the assessment developed?
It is the result of a participatory effort initiated in mid-2019 and led by the association Labelia Labs. This process is described in this blog post, which we recommend reading.
Let us nonetheless summarize some contextual elements from the article. First, our observation is that a growing tension has emerged between, on the one hand, the potential and usefulness of AI techniques, and on the other, the difficulty of trusting these techniques or their implementations (whether by private actors such as Apple with the Apple Card or Tesla in this surprising example, or by public actors such as governments, see COMPAS regarding parole decisions in the US, the recurring debates around Parcoursup in France, unemployment benefits in the Netherlands, and many others). In this context, it becomes increasingly challenging for an organization to deploy data-science-based approaches in its products and services and to publicly stand behind them.
Obviously, this tension is not new. Some risks are very real, and there seems to be a broad consensus on the need to establish reassuring and structured frameworks. A quick search for “AI and ethics” or “responsible AI” is enough to show the proliferation of initiatives in this space. As a result, there is a wealth of material available. However, much of it consists of lists of high-level principles and does not provide concrete, operational guidance. How should an organization position itself? How can it assess its own practices? What should it work on to comply with these principles?
It is from this reflection that we set out to develop a tool intended for practitioners—useful and actionable as early as possible. Give it a try and tell us what you think.
Who is this assessment for?
The self-assessment tool was designed to suit (and, we hope, provide value to) all organizations engaging in data-science, AI, or ML activities: companies, university labs, start-ups, specialized consultants, and so on. A data scientist, a team lead, or a CTO, for example, can complete the assessment. The tool also allows multiple users to work on the same organization’s assessment, making it possible to divide the topics among team members.
How is the assessment structured?
It consists of six thematic sections. We chose not to follow the seven themes of the EU high-level expert group report or its ALTAI tool, opting instead for a structure that we hope is more pragmatic and closer to the lifecycle of a data-science project. Time will tell whether this approach proves effective.
Is the assessment fixed, or will it evolve?
It will continue to evolve. From the outset, it was clear that this would be an iterative effort: it was unimaginable to work for a period of time, publish the result, and move on. The field evolves quickly, and perspectives differ widely (large corporations, public-sector organizations, small start-ups, specialized consultants, regulators…). It was therefore necessary to start somewhere and improve over time. Now that the platform is online, the goal is not to introduce updates every week, as this would constantly render ongoing or recently completed assessments obsolete. We therefore aim for a timescale of roughly a quarter or a semester.
To support these updates and make them positive for users and organizations that have already completed an assessment, the platform includes a migration feature. This allows a given assessment to be migrated to a newer version of the evaluation framework: all answers to unchanged items are preserved.
Score
The composite score has a theoretical maximum of 100 points for the entire assessment. It provides an indication of an organization's maturity level regarding a responsible and trustworthy approach to data science. As of late 2020, a threshold of 50/100 can be considered a very advanced level of maturity.
The scoring mechanism is relatively simple:
With each version of the assessment, we define a number of points for each response item of every evaluation element, as well as an importance weighting calibrated to ensure that the theoretical maximum total is exactly 100.
For single-choice items, the point value of the selected response is retained; for multiple-choice items, the point values of all selected responses are summed.
The total score obtained is the sum of the point values for each element, weighted by the importance weighting.
There is, however, a subtlety in cases where an organization is not concerned by certain evaluation elements and by the associated risk domains. It would be illogical to deprive an organization that is not exposed to a given risk of points that other organizations—actually exposed to that risk—may obtain. At the same time, it would also be illogical to award all possible points automatically, as this would result in artificially high scores even when little is actually being done. The mechanism to address this issue is as follows:
When an organization is not concerned by an evaluation element, it is automatically granted half of the maximum number of points for that element. The other half contributes to a temporary variable representing the number of points that cannot be obtained.
Once all non-applicable elements have been processed, an intermediate score is computed by summing the points for each element. This intermediate score is therefore not out of 100, but out of an intermediate maximum = (100 – number of points that cannot be obtained).
This intermediate score is then stretched back to a 0–100 scale by applying a factor of (100 / intermediate maximum).
This mechanism is a compromise intended to ensure:
(i) that the fact of not being exposed to certain risks is taken into account;
(ii) that every assessment score is always reported on a 0–100 scale.
Additional clarifications in response to frequently asked questions:
Why are point values for each response item not shown during the assessment?
After studying several evaluation systems for professional practices across different sectors, we concluded that it is best not to display point values during the assessment. Our goal is to prioritize the substantive content and to limit the risk of distracting or influencing the user by revealing numerical information that could lead them to optimize for points rather than provide an accurate representation of their practices.