Contributing Editor David J. Hand (Imperial College London) has been thinking about the ethical, social and policy challenges associated with the rise and rise of “big data”:

Data ethics seem to be the flavour of the month. In the UK alone, the establishment of the National Statistician’s Data Ethics Advisory Committee has been quickly followed by the Government Department of Digital, Culture, Media and Sport launching its Data Ethics and Innovation Centre, and the Nuffield Foundation launching its Ada Lovelace Centre, aimed at taking “a lead on the interaction between data, ethics, and artificial intelligence in the UK”. And there’s nothing unique about the UK in this—a quick Google search shows a proliferation of such bodies with, for example, the Council for Big Data, Ethics, and Society being established in the US in 2014, aimed at providing “critical social and cultural perspectives on big data initiatives”. Indeed, it is not even limited to governments: corporations and other bodies are also concerned that their use of data, which is often central to their business model, should be ethically sound, not least to avoid the risk of public backlash and possibly highly restrictive legislation

Of course, statisticians have long been aware of the ethical dimensions of their work, though usually these were manifest through particular application domains, such as a requirement to include statisticians on medical ethics committees, or the requirement to be able to explain an adverse decision in the context of consumer loans. Professional bodies of statisticians, such as the ASA and RSS, have long had systems of ethical guidelines, as have other organisations for which data are central (e.g. the ACM).

But more recently, recognition of the need for such ethical oversight has grown, mainly because of raised awareness of the potential and pervasiveness of big data, data science, and artificial intelligence. Attention has shifted, from rather specialised concerns for informed consent in clinical trials, the preservation of anonymity in survey work, avoiding prohibited variables in insurance decisions, and so on, to much more “in-your-face” issues. These are matters such as selection bias leading to racist decisions, chatbots being gratuitously offensive, and questions of who is responsible when a driverless car crashes or a data theft leads to fraud.

Incidents like these occur for a variety of reasons. Automatic data collection leads to massive data sets accumulating without human oversight. Adaptive and self-learning algorithms go their own way (that’s the whole point, really). And the line between research and practice is becoming blurred in many contexts. Moreover, there is increasing tension between the data minimisation principle (that only sufficient data should be collected to answer the specific question) and the promise of data mining (that large data sets contain nuggets of great potential interest and value).

Resolutions of such tensions are not easy to arrive at, and solutions are complicated by the nature of public opinion—which is both heterogeneous and volatile. Different sections of the public, having had different experiences and been exposed to different circumstances, will have different views on what is right, legitimate, and proper. Worse still, those views will fluctuate with time—perhaps especially in response to events such as media reports of data losses or thefts, or fraud associated with advanced use of data.

Although sometimes described as the new oil, because of the way data, and data science, are revolutionising society just as fossil fuels did earlier, data have unique properties, leading to correspondingly unique ethical challenges. These properties will be very familiar to statisticians: data can be copied (as many times as you like), data can be sold or given away and yet simultaneously retained, data can be used multiple times for many different purposes, data can be of insufficient quality for some uses and yet perfectly adequate for other uses, and so on.

Such diverse applications and properties of data are compounded when data sets are linked, perhaps in unforeseen and indeed unforeseeable ways. A data set might even be linked to new data which did not exist at the time the first data set was collected. There are already plenty of examples where privacy has been breached through sophisticated linking exercises.

Ethical considerations cover the concept of personal data (this lies at the core of the EU’s General Data Protection Regulation), data ownership (is this a meaningful concept? Some regard data they have collected, possibly at great expense, as theirs, while others regard such data as belong to the person they describe), consent and purpose, privacy and confidentiality, the right to be forgotten, the right to access data, an awareness of new developments in data science technology, the views of the public, and trustworthiness

Such considerations do not permit simple formulaic answers, since these must be context-dependent and dynamic. Instead, solutions must be principles-based, with higher-level considerations guiding decisions in any particular context. These principles include that the data and their analysis should serve the public good, should be transparent, must be non-discriminatory, should be trustworthy and honest, should protect individual identities, and should adhere to legal requirements. Moreover, the world of data and data science is changing rapidly, as large data sets continue to accumulate, as new analytic tools continue to be developed, and as real-time and online processing becomes increasingly prevalent (for example, with the advent of the Internet of Things). This means that the principles must be regularly reviewed to see that they remain adequate.

In seeking to apply ethical principles, a delicate balance must be often be struck. Constraints on data science must not be so great that they stifle innovation and social progress, preventing statistics and data science from benefiting humanity. That would be just as unethical.

Further reading

European Data Protection Supervisor (2015) Towards a New Digital Ethics: Data, Dignity, and Technology, https://edps.europa.eu/sites/edp/files/publication/15-09-11_data_ethics_en.pdf

Philosophical Transactions of the Royal Society, Volume 374, Issue 2083, theme issue on The Ethical Impact of Data Science.

Hand D.J. (2018) Aspects of data ethics in a changing world: where are we now? Big Data, 6, 176–190.

Metcalf J., Keller E.F., and Boyd D. (2016) Perspectives on Big Data, Ethics, and Society. The Council for Big Data, Ethics, and Society.

Zwitter A.Z. (2014) Big data ethics. Big Data and Society, July-December, 1–6.