Dimitris Politis is a professor in the Department of Mathematics at the University of California in San Diego. He is one of the IMS Bulletin’s Contributing Editors, and a former Editor (January 2011–December 2013). Here, he writes about his most recent pastime, Model-Free Prediction:

1. Estimation

Parametric models served as the cornerstone for the foundation of Statistical Science in the beginning of the 20th century by R.A. Fisher, K. Pearson, J. Neyman, E.S. Pearson, W.S. Gosset (also known as “Student”), etc.; their seminal developments resulted into a complete theory of statistics that could be practically implemented using the technology of the time, i.e., pen and paper (and slide-rule!). While some models are inescapable, e.g. modeling a polling dataset as a sequence of independent Bernoulli random variables, others appear contrived, often invoked for the sole reason to make the mathematics work. As a prime example, the ubiquitous—and typically unjustified—assumption of Gaussian data permeates statistics textbooks to the day. Model criticism and diagnostics were subsequently developed as a practical way out.

With the advent of widely accessible powerful computing in the late 1970s, computer-intensive methods such as resampling and cross-validation created a revolution in modern statistics. Using computers, statisticians became able to analyze big datasets for the first time, paving the way towards the ‘big data’ era of the 21st century. But perhaps more important was the realization that the way we do the analysis could/should be changed as well, as practitioners were gradually freed from the limitations of parametric models. For instance, the great success of Efron’s (1979) bootstrap was in providing a complete theory for statistical inference under a nonparametric setting much like Maximum Likelihood Estimation had done half a century earlier under the restrictive parametric setup.

Nevertheless, there is a further step one may take, i.e., going beyond even nonparametric models. To explain this, let us first focus on regression, i.e., data that are pairs: (Y1,X1), (Y2,X2), … , (Yn,Xn) where Yi is the measured response associated with a regressor value of Xi. The standard homoscedastic additive model in this situation reads:

Yi = μ(Xi) + ϵi                                                        (1)

where the random variables ϵi are assumed to be independent, identically distributed (i.i.d.) from a distribution F(·) with mean zero.

• Parametric model: Both μ(·) and F(·) belong to parametric families of functions, i.e., a setup where the only unknown is a finite-dimensional parameter; a typical example is straight-line regression with Gaussian errors, i.e., μ(x) = β0 + β1x and F(·) being N(0, σ2).

• Semiparametric model: μ(·) belongs to a parametric family, whereas F(·) does not; instead, it may be assumed that F(·) belongs to a smoothness class, e.g., assume that F(·) is absolutely continuous.

• Nonparametric model: Neither μ(·) nor F(·) can be assumed to belong to parametric families of functions.

Despite the nonparametric aspect of it, even the last option constitutes a model, and can thus be rather restrictive. To see why, note that eq. (1) with i.i.d. errors is not satisfied in many cases of interest even after allowing for heteroscedasticity of the errors. Nevertheless, it is possible to shun eq. (1) altogether and instead adopt a model-free setup that can be described as follows.

Model-Free Regression:

– Random design. The pairs (Y1,X1), (Y2,X2), … , (Yn,Xn) are i.i.d.

– Deterministic design. The variables X1, … , Xn are deterministic, and the random variables Y1, … , Yn are independent with common conditional distribution, i.e., P{Yjy|Xj =x} = Dx(y) not depending on j.

Inference for features, i.e. functionals, of the common conditional distribution Dx(·) is still possible under some regularity conditions, e.g. smoothness. Arguably, the most important such feature is the conditional mean E(Y | X = x) that can be denoted μ(x). When μ(x) can be assumed smooth, it can be consistently estimated by a local average and/or local polynomial. Asymptotic normality and/or resampling can then be invoked to construct confidence intervals for μ(x).

2. Prediction

Traditionally, the problem of prediction has been approached in a model-based way, i.e., (a) fit a model such as (1), and then use the fitted model for prediction of a future response Yf associated with a regressor value xf. Note that even in the absence of model (1), the conditional expectation μ(xf)=E(Yf |Xf = xf) is the Mean Squared Error (MSE) optimal predictor of Yf. As already mentioned, μ(xf) can be estimated in a Model-Free way and then used for predicting Yf but a problem remains: how to gauge the accuracy of prediction, i.e., how to construct a prediction—as opposed to confidence—interval.

Interestingly, it is possible to accomplish the goal of point and interval prediction of Yf under the Model-Free regression setup in a direct fashion, i.e., without the intermediate step of model-fitting; this is achieved via the Model-Free Prediction Principle expounded upon in Politis (2015). Model-Free Prediction restores the emphasis on observable quantities, i.e., current and future data, as opposed to unobservable model parameters and estimates thereof. In this sense, the Model-Free Prediction Principle is concordant with Bruno de Finetti’s statistical philosophy. Notably, being able to predict the response Yf associated with the regressor Xf taking on any possible value (say xf) seems to inadvertently also achieve the main goal of modeling, i.e., trying to relate how Y depends on X. In so doing, the solution to interesting estimation problems is obtained as a by-product, e.g. inference on features of Dx(·) such as its mean μ(x). In other words, as prediction can be treated as a by-product of model-fitting, key estimation problems can be solved as a by-product of being able to perform prediction. Hence, a Model-Free approach to frequentist statistical inference is possible, including prediction and confidence intervals.

3. The Model-Free Prediction Principle

Consider the Model-Free regression set-up with a vector of observed responses Yn = (Y1, … , Yn) that are associated with the vector of regressors Xn = (X1, … , Xn). Also consider the enlarged vectors
Yn+1 = (Y1, … , Yn, Yn+1) and Xn+1 = (X1, … , Xn, Xn+1) where (Yn+1, Xn+1) is an alternative notation for (YfXf); recall that Yf is yet unobserved, and Xf will be set equal to the value xf of interest. If the Yis were i.i.d. (and not depending on their associated X value), then prediction would be trivial: the MSE–optimal predictor of Yn+1 is simply given by the common expected value of the Yis, completely disregarding the value of Xn+1.

In a nutshell, the Model-Free Prediction Principle amounts to using the structure of the problem in order to find an invertible transformation Hm that can map the non-i.i.d. vector Ym to a vector ϵm = (ϵ1, … , ϵm) that has i.i.d. components; here m could be taken equal to either n or n+1 as needed. Letting Hm−1 denote the inverse transformation, we have ϵm = Hm(Ym) and Ym = Hm−1(ϵm), i.e.,

$\underline{Y}_m \stackrel{H_m}{ \longmapsto } \underline{\epsilon}_m
\ \ \mbox{and} \ \ \underline{\epsilon}_m
\stackrel{H_m^{-1}}{ \longmapsto } \underline{Y}_m .$ (2)

If the practitioner is successful in implementing the Model-Free procedure, i.e., in identifying (and estimating) the transformation Hm to be used, then the prediction problem is reduced to the trivial one of predicting i.i.d. variables. To see why, note that eq. (2) with $m=n+1$ yields Yn+1 = Hn+1−1(ϵn+1) = Hn+1−1(ϵn , ϵn+1). But ϵn can be treated as known (and constant) given the data Yn; just use eq. (2) with m = n. Since the unobserved Yn+1 is just the (n + 1)th coordinate of vector Yn+1, we have just expressed Yn+1 as a function of the unobserved ϵn+1. Note that predicting a function, say g(·), of an i.i.d. sequence ϵ1, … , ϵn, ϵn+1 is straightforward because g(ϵ1), … , g(ϵn), g(ϵn+1) is simply another sequence of i.i.d. random variables. Hence, the practitioner can use this simple structure to develop point predictors for the future response Yn+1.

Prediction intervals can then be immediately constructed by resampling the i.i.d. variables ϵ1, … , ϵn; this can be thought to give an extension of the model-based, residual bootstrap of Efron (1979) to Model-Free settings since, if model (1) were to hold true, the residuals from the model could be considered as the outcomes of the requisite transformation Hn.

4. Time series

Under regularity conditions, a transformation such as Hm of the Model-Free Prediction Principle always exists but is not necessarily unique. For example, if the variables (Y1, … , Ym) have an absolutely continuous joint distribution and no explanatory variables Xm are available, then the Rosenblatt (1952) transformation can map them onto a set of i.i.d. random variables. Nevertheless, estimating the Rosenblatt transformation from data may be infeasible except in special cases. On the other hand, a practitioner may exploit a given structure for the data at hand, e.g., a regression structure, in order to construct a different, case-specific transformation that may be practically estimable from the data.

Recall that the Rosenblatt transformation maps an arbitrary random vector Ym = (Y1, … , Ym) having absolutely continuous joint distribution onto a random vector Um = (U1, … , Um) whose entries are i.i.d. Uniform(0,1); this is done via the probability integral transform based on conditional distributions. For k > 1 define the conditional distributions Fk(yk| yk−1, … , y1) = P{Ykyk|Yk−1 = yk−1, … , Y1 = y1}, and let F1(y1) = P{Y1y1}. Then the Rosenblatt transformation amounts to letting U1=F1(Y1), U2=F2(Y2|Y1), U3 = F3(Y3| Y2, Y1), …, and Um = Fm(Ym| Ym−1, … , Y2, Y1).

The problem is that the distributions Fk for k ≥ 1 are typically unknown and must be estimated (in a continuous fashion) from the Yn data at hand. However, unless there is some additional structure, this estimation task may be unreliable or plain infeasible for large k. As an extreme example, note that to estimate Fn we would have only one point (in n-dimensional space) to work with. Hence, without additional assumptions, the estimate of Fn would be a point mass which is a completely unreliable estimate, and of little use in terms of constructing a probability integral transform due to its discontinuity.

An example of additional structure is the Markov setup. To elaborate, suppose that the data Y1, … , Yn are a realization of a stationary (and ergodic) Markov chain. In this case, the conditional distributions Fk for all k > 1 are completely determined by the one-step transition distribution, namely F2. To see why, note that the Markov assumption implies that P{Ykyk|Yk−1 = yk−1, … , Y1 = y1} = P{Ykyk|Yk−1 = yk−1} for k > 1. Hence, the practitioner may use kernel smoothing or a related technique on the data pairs {(Yj, Yj+1) for j = 1, … , n−1} in order to estimate the common joint distribution of these pairs. In turn, this yields estimates of F1 and F2, and by extension Fk for k > 2, so that the Rosenblatt transformation can be practically implemented as part of the Model-Free Prediction Principle.

Further examples of transformations applicable to diverse settings with regression and/or time series data are discussed in Politis (2015).

References

[1] Efron, B. (1979). Bootstrap methods: another look at the jackknife, Ann. Statist., vol. 7, pp. 1–26.

[2] Politis, D.N. (2015). Model-Free Prediction and Regression: A Transformation-Based Approach to Inference, Springer, New York.

[3] Rosenblatt, M. (1952). Remarks on a multivariate transformation. Ann. Math. Statist., vol. 23, pp. 470–472.