“Ranking Our Excellence,” or “Assessing Our Quality,” or Whatever…

 

One of the final responsibilities of each IMS President is to deliver a Presidential Address at the IMS annual meeting. Peter Hall gave his talk on August 1 at JSM in Miami. The text is reproduced in full here.

We live in an era where almost everything apparently can be quantified, and most things are. Massive quantities of data are generated every day on subjects ranging from our supermarket purchases to changes in the climate. This creates unprecedented opportunities for statisticians, but also many challenges. Not least are the challenges of quantifying our own professional lives. The managers of the institutions, organisations and companies where we work measure our performance, and the quality of our work, using rankings, bibliometric analyses and similar approaches.

Tonight I want to say a few words about these matters. They are fundamentally statistical in many ways, yet no-one, least of all us or our managers, has many of the tools necessary to undertake the analysis properly and convey the results. In many cases we do not even seem to have the knowledge needed to develop the tools.

Rankings, for example of the institutions or departments where we work, are among the most common of the techniques that are used to analyse us, and perhaps to divide and conquer us. Goldstein and Spiegelhalter, in a paper in the Journal of the Royal Statistical Society (Series A) in 1996, argued that no ranking should be unaccompanied by a measure of its authority. This recommendation is seldom honoured. Rankings should also be reasonably transparent, so that the implicit statements that they make about us are clear to non-experts (in some cases that includes us!).

International rankings of universities are a case in point. The data on which they rely are seldom accessible, and the methodology they use is typically secret. The methodology usually can be accessed only approximately, for example via a process of reverse engineering, noting the changes in rankings after Nobel Prizes, etc, have been awarded. For all these reasons the rankings are far from transparent. And they are seldom accompanied by a measure of their reliability. Yet they influence significantly the comments, and subsequently the actions, of university managers and, sometimes, the decisions of governments.

The US National Research Council ranking of university statistics departments, compiled using data on graduate programs and released almost a year ago, is to be applauded for incorporating statistical measures of authority, based on resampling ideas, into their analysis. An extensive manual describes methodology and metrics. Admirably, last November the ASA organised a NISS workshop on assessing the quality of graduate programs. However, the ranking of graduate programs does not meet the criterion of transparency. As a result, the ranking sometimes has been misinterpreted by university managers. More generally, statistical issues relating to modelling rankings, describing their authority and simplifying their interpretation, deserve greater attention from us than we have given them.

Rankings, like bar graphs and pie charts, are only a way of presenting data. The data themselves should also be a focus of our attention. The IMS as a society, and also through its individual members, has played an important role in the study of performance-related data. In particular, the IMS was one of three professional bodies (the others were the International Mathematical Union and the International Council of Industrial and Applied Mathematics) which produced the report Citation Statistics, addressing that most pernicious of academic topics, bibliometric data and their analysis. The report’s authors, Robert Adler, John Ewing and Peter Taylor, expressed concern about the use of citation data for assessing research performance. Their report, published in Statistical Science in 2009, noted that:

There is a belief that citation statistics are inherently more accurate because they substitute simple numbers for complex judgments, and hence overcome the possible subjectivity of peer review.

Adler, Ewing and Taylor pointed to the fallacy of such beliefs, and drew conclusions that, although concerning to some scientists (and to university and other institutional managers), resonate with many of us:

• The accuracy of citation metrics (e.g., raw citation counts, impact factors, h-factors and so forth) is illusory. Moreover, the misuse of citation statistics is widespread and egregious. In spite of repeated attempts to warn against such misuse (e.g., the misuse of the impact factor), governments, institutions, and even scientists themselves continue to draw unwarranted or even false conclusions from the misapplication of citation statistics.

• Sole reliance on citation-based metrics replaces one kind of judgment with another: Instead of subjective peer review one has the subjective interpretation of a citation’s meaning.

• While statistics are valuable for understanding the world in which we live, they provide only a partial understanding. Those who promote the use of citation statistics as a replacement for a fuller understanding of research implicitly hold such a belief. We not only need to use statistics correctly—we need to use them wisely as well.

Bibliometric analyses, and other statistical measures, are sometimes used in connection with appointment and promotion cases, occasionally with dramatic consequences (as we’ll see below, when discussing the Australian experience). Indeed, potential applications to appointment and promotions decisions are major reasons for heightened interest in the interpretation and application of bibliometric analyses. However, many other empirical approaches have been employed. They range from the Texas A&M University system’s development of methodology, which evaluates how much university professors “are worth,” based on their salaries, how much research money they bring in, and how much money they generate from teaching; to the much more widespread use of student evaluations of classroom teaching. All have weaknesses, and some have strengths as well.

Several nations have attempted, or are currently attempting, to unlock the secrets of bibliometric data so that they can use them to their advantage. For example, the first Italian national research evaluation, which commenced in 2003, sought to answer the following questions, among others:

(i) Are peer review judgements and bibliometric indicators independent variables, and if not, what is the strength of association between them?

(ii) Is the association between peer judgement and article citation rating significantly stronger than the association between peer judgement and journal citation rating?

The Italians concluded that:

(i) Bibliometrics are not independent of peer review assessment; but while the correlation between peer assessment and bibliometric indicators is statistically significant, it is not perfect.

(ii) “Bibliometric indicators may be considered as approximation measures of the inherent quality of papers, which, however, remains fully assessable only with aid of human unbiased judgement, meditation, and elaboration. We advocate the integration of peer review with bibliometric indicators, in particular those directly related to the impact of individual articles, during the next national assessment exercises.”

Australian authorities too have been endeavouring to turn bibliometric data to good use, so as to distribute block grant funding to universities. (This is the type of funding that, in the US, is passed on largely in the form of grant overheads. In this sense, grant income in the US is used as a proxy for research performance.) Interestingly, in 2011 a trial run of Australia’s research assessment exercise led the federal government to conclude that a rather controversial ranked list of journals, to which the government had seemed to be committed, was not providing the overall benefits that had been anticipated. Announcing the virtual abandonment of journal rankings three months ago, the Australian Minister for Innovation, Industry, Science and Research noted that the ranking was being seriously misused by university managers:

There is clear and consistent evidence that the rankings were being deployed inappropriately within some quarters of the [university] sector, in ways that could produce harmful outcomes, and [were] based on a poor understanding of the actual role of the rankings. One common example was the setting of targets [in connection with appointments or promotions] for publication in [highly ranked] journals by institutional research managers.

However, there is concern that in a range of research fields in Australia, rankings based on citation analyses, rather than peer review, will fill any gaps left by reducing reliance on journal rankings. Moreover, journal rankings will still apparently be used to some extent in the Australian research assessment process.

The Australian government’s enthusiasm for a research assessment exercise that emphasises bibliometric analysis, and, in many fields of research, seems to give peer review a significantly lesser role, would appear to be at odds with the Italian experience. It is also in conflict with the UK’s proposed new methodology for assessing research importance, which, as we shall now relate, originally seemed to favour a largely bibliometric process but then retreated. Just as importantly, the Australian government’s decision to largely abandon journal rankings reflects an issue that has to be borne in mind whenever funding is distributed as the outcome of a process of performance evaluation: As is no doubt hoped, the people receiving that funding respond to their evaluation by changing their behaviour so as to achieve more funding next time, but they may change in ways that are overtly counterproductive to achieving the goals that the process was originally designed for.

The UK once argued rather forthrightly that its Research Assessment Exercise (RAE) should not explicitly take account of citation data and the like. For example, the instructions to 2008 RAE panels included the following injunction:

In assessing excellence, the sub-panel will look for originality, innovation, significance, depth, rigour, influence on the discipline and wider fields and, where appropriate, relevance to users. In assessing publications the sub-panel will use the criteria in normal use for acceptance by internationally recognised journals. The sub-panel will not use a rigid or formulaic method of assessing research quality. It will not use a formal ranked list of outlets, nor impact factors, nor will it use citation indices in a formulaic way.

However, the replacement for the RAE was originally intended to use metrics in place of peer review:

It is the Government’s intention that the current method for determining the quality of university research—the UK Research Assessment Exercise (RAE)—should be replaced after the next cycle is completed in 2008. Metrics, rather than peer-review, will be the focus of the new system and it is expected that bibliometrics (using counts of journal articles and their citations) will be a central quality index in this system.
[Evidence Report, Research Policy Committee of Universities, p.3]

The Times Higher Education Supplement for 9 November 2007 headlined a story on its front page with the words, “New RAE based on citations,” and commented thus:

… After next year’s RAE, funding chiefs will measure the number of citations for each published paper in large science subjects as part of the new system to determine the allocation of more than £1 billion a year in research funding. A report published by Universities UK enforces such a “citations per paper” system as the only sensible option among a number of so-called bibliometric quality measurements. It concludes that measuring citations can accurately indicate research quality.

The UK apparently has withdrawn from this position. The Research Excellence Framework (REF), which will replace the RAE and be run for the first time in 2013, will instead assess research in at least three ways and incorporate peer review:

• Outputs (total weight 65%): The primary focus of the REF exercise will be to identify excellent research, apparently using largely expert peer review but, in subjects where robust data are available, “peer review may be informed by additional citation information.”

• Impact (20%): The REF will endeavour to identify cases where “researchers build on excellent research to deliver demonstrable benefits to society, public policy, culture, quality of life and the economy.”

• Environment (15%): The REF will assess, and take into account, the quality of the research environment.

In the UK, particularly in the mathematical sciences including statistics, the second of these criteria has become the most controversial, not least because of its potential to focus research on short-term goals. It is admittedly less of a problem for statisticians than for, say, pure mathematicians.

The US is not as vulnerable to these difficulties as many other nations—such as the UK, European countries including France, Italy and The Netherlands (but not Germany), and Australia—that have relatively unified national higher education systems. Admittedly, substantial research funding is provided federally in the US, but few US states would allow their university systems to be scrutinised federally in the manner that is accepted practice in other countries. In a state or provincial system a state can experiment with different ways of measuring and rewarding research performance, and the others can sit on the sidelines and watch the experiment, adopting the methodology only if it is effective.

As statisticians we should become more involved in these matters than we are. We are often the subject of the analyses discussed above, and almost alone we have the skills to respond to them, for example by developing new methodologies or by pointing out that existing approaches are challenged. To illustrate the fact that issues that are obvious to statisticians are often ignored in bibliometric analysis, I mention that many proponents of impact factors, and other aspects of citation analysis, have little concept of the problems caused by averaging very heavy tailed data. (Citation data are typically of this type.) We should definitely take a greater interest in this area.

Acknowledgement: I am grateful to Rudy Beran, Valerie Isham and Wolfgang Polonik for helpful comments.