Institute of Mathematical Statistics | Model-free inference in statistics: how and why

Model-free inference in statistics: how and why

November 17, 2015

Dimitris Politis is a professor in the Department of Mathematics at the University of California in San Diego. He is one of the IMS Bulletin’s Contributing Editors, and a former Editor (January 2011–December 2013). Here, he writes about his most recent pastime, Model-Free Prediction:

1. Estimation

Parametric models served as the cornerstone for the foundation of Statistical Science in the beginning of the 20th century by R.A. Fisher, K. Pearson, J. Neyman, E.S. Pearson, W.S. Gosset (also known as “Student”), etc.; their seminal developments resulted into a complete theory of statistics that could be practically implemented using the technology of the time, i.e., pen and paper (and slide-rule!). While some models are inescapable, e.g. modeling a polling dataset as a sequence of independent Bernoulli random variables, others appear contrived, often invoked for the sole reason to make the mathematics work. As a prime example, the ubiquitous—and typically unjustified—assumption of Gaussian data permeates statistics textbooks to the day. Model criticism and diagnostics were subsequently developed as a practical way out.

With the advent of widely accessible powerful computing in the late 1970s, computer-intensive methods such as resampling and cross-validation created a revolution in modern statistics. Using computers, statisticians became able to analyze big datasets for the first time, paving the way towards the ‘big data’ era of the 21st century. But perhaps more important was the realization that the way we do the analysis could/should be changed as well, as practitioners were gradually freed from the limitations of parametric models. For instance, the great success of Efron’s (1979) bootstrap was in providing a complete theory for statistical inference under a nonparametric setting much like Maximum Likelihood Estimation had done half a century earlier under the restrictive parametric setup.

Nevertheless, there is a further step one may take, i.e., going beyond even nonparametric models. To explain this, let us first focus on regression, i.e., data that are pairs: (Y₁,X₁), (Y₂,X₂), … , (Y_n,X_n) where Y_i is the measured response associated with a regressor value of X_i. The standard homoscedastic additive model in this situation reads:

Y_i = μ(X_i) + ϵ_i (1)

where the random variables ϵ_i are assumed to be independent, identically distributed (i.i.d.) from a distribution F(·) with mean zero.

• Parametric model: Both μ(·) and F(·) belong to parametric families of functions, i.e., a setup where the only unknown is a finite-dimensional parameter; a typical example is straight-line regression with Gaussian errors, i.e., μ(x) = β₀ + β₁x and F(·) being N(0, σ²).

• Semiparametric model: μ(·) belongs to a parametric family, whereas F(·) does not; instead, it may be assumed that F(·) belongs to a smoothness class, e.g., assume that F(·) is absolutely continuous.

• Nonparametric model: Neither μ(·) nor F(·) can be assumed to belong to parametric families of functions.

Despite the nonparametric aspect of it, even the last option constitutes a model, and can thus be rather restrictive. To see why, note that eq. (1) with i.i.d. errors is not satisfied in many cases of interest even after allowing for heteroscedasticity of the errors. Nevertheless, it is possible to shun eq. (1) altogether and instead adopt a model-free setup that can be described as follows.

• Model-Free Regression:

– Random design. The pairs (Y₁,X₁), (Y₂,X₂), … , (Y_n,X_n) are i.i.d.

– Deterministic design. The variables X₁, … , X_n are deterministic, and the random variables Y₁, … , Y_n are independent with common conditional distribution, i.e., P{Y_j ≤ y|X_j =x} = D_x(y) not depending on j.

Inference for features, i.e. functionals, of the common conditional distribution D_x(·) is still possible under some regularity conditions, e.g. smoothness. Arguably, the most important such feature is the conditional mean E(Y | X = x) that can be denoted μ(x). When μ(x) can be assumed smooth, it can be consistently estimated by a local average and/or local polynomial. Asymptotic normality and/or resampling can then be invoked to construct confidence intervals for μ(x).

2. Prediction

Traditionally, the problem of prediction has been approached in a model-based way, i.e., (a) fit a model such as (1), and then use the fitted model for prediction of a future response Y_f associated with a regressor value x_f. Note that even in the absence of model (1), the conditional expectation μ(x_f)=E(Y_f|X_f = x_f) is the Mean Squared Error (MSE) optimal predictor of Y_f. As already mentioned, μ(x_f) can be estimated in a Model-Free way and then used for predicting Y_f but a problem remains: how to gauge the accuracy of prediction, i.e., how to construct a prediction—as opposed to confidence—interval.

Interestingly, it is possible to accomplish the goal of point and interval prediction of Y_f under the Model-Free regression setup in a direct fashion, i.e., without the intermediate step of model-fitting; this is achieved via the Model-Free Prediction Principle expounded upon in Politis (2015). Model-Free Prediction restores the emphasis on observable quantities, i.e., current and future data, as opposed to unobservable model parameters and estimates thereof. In this sense, the Model-Free Prediction Principle is concordant with Bruno de Finetti’s statistical philosophy. Notably, being able to predict the response Y_f associated with the regressor X_f taking on any possible value (say x_f) seems to inadvertently also achieve the main goal of modeling, i.e., trying to relate how Y depends on X. In so doing, the solution to interesting estimation problems is obtained as a by-product, e.g. inference on features of D_x(·) such as its mean μ(x). In other words, as prediction can be treated as a by-product of model-fitting, key estimation problems can be solved as a by-product of being able to perform prediction. Hence, a Model-Free approach to frequentist statistical inference is possible, including prediction and confidence intervals.

3. The Model-Free Prediction Principle

Consider the Model-Free regression set-up with a vector of observed responses Y_n = (Y₁, … , Y_n)′ that are associated with the vector of regressors X_n = (X₁, … , X_n)′. Also consider the enlarged vectors
Y_n+₁ = (Y₁, … , Y_n, Y_n+₁)′ and X_n+₁ = (X₁, … , X_n, X_n+₁)′ where (Y_n+₁, X_n+₁) is an alternative notation for (Y_f, X_f); recall that Y_f is yet unobserved, and X_f will be set equal to the value x_f of interest. If the Y_is were i.i.d. (and not depending on their associated X value), then prediction would be trivial: the MSE–optimal predictor of Y_n+₁ is simply given by the common expected value of the Y_is, completely disregarding the value of X_n+₁.

In a nutshell, the Model-Free Prediction Principle amounts to using the structure of the problem in order to find an invertible transformation H_m that can map the non-i.i.d. vector Y_m to a vector ϵ_m = (ϵ₁, … , ϵ_m)′ that has i.i.d. components; here m could be taken equal to either n or n+1 as needed. Letting H_m⁻¹ denote the inverse transformation, we have ϵ_m = H_m(Y_m) and Y_m = H_m⁻¹(ϵ_m), i.e.,

$\underline{Y}_m \stackrel{H_m}{ \longmapsto } \underline{\epsilon}_m
\ \ \mbox{and} \ \ \underline{\epsilon}_m
\stackrel{H_m^{-1}}{ \longmapsto } \underline{Y}_m .$ (2)

If the practitioner is successful in implementing the Model-Free procedure, i.e., in identifying (and estimating) the transformation H_m to be used, then the prediction problem is reduced to the trivial one of predicting i.i.d. variables. To see why, note that eq. (2) with $m=n+1$ yields Y_n+₁ = H_n+₁⁻¹(ϵ_n+₁) = H_n+₁⁻¹(ϵ_n, ϵ_n+₁). But ϵ_n can be treated as known (and constant) given the data Y_n; just use eq. (2) with m = n. Since the unobserved Y_n₊₁ is just the (n + 1)^th coordinate of vector Y_n+₁, we have just expressed Y_n+₁ as a function of the unobserved ϵ_n+₁. Note that predicting a function, say g(·), of an i.i.d. sequence ϵ₁, … , ϵ_n, ϵ_n+₁ is straightforward because g(ϵ₁), … , g(ϵ_n), g(ϵ_n+₁) is simply another sequence of i.i.d. random variables. Hence, the practitioner can use this simple structure to develop point predictors for the future response Y_n+₁.

Prediction intervals can then be immediately constructed by resampling the i.i.d. variables ϵ₁, … , ϵ_n; this can be thought to give an extension of the model-based, residual bootstrap of Efron (1979) to Model-Free settings since, if model (1) were to hold true, the residuals from the model could be considered as the outcomes of the requisite transformation H_n.

4. Time series

Under regularity conditions, a transformation such as H_m of the Model-Free Prediction Principle always exists but is not necessarily unique. For example, if the variables (Y₁, … , Y_m) have an absolutely continuous joint distribution and no explanatory variables X_m are available, then the Rosenblatt (1952) transformation can map them onto a set of i.i.d. random variables. Nevertheless, estimating the Rosenblatt transformation from data may be infeasible except in special cases. On the other hand, a practitioner may exploit a given structure for the data at hand, e.g., a regression structure, in order to construct a different, case-specific transformation that may be practically estimable from the data.

Recall that the Rosenblatt transformation maps an arbitrary random vector Y_m = (Y₁, … , Y_m)′ having absolutely continuous joint distribution onto a random vector U_m = (U₁, … , U_m)′ whose entries are i.i.d. Uniform(0,1); this is done via the probability integral transform based on conditional distributions. For k > 1 define the conditional distributions F_k(y_k| y_k−₁, … , y₁) = P{Y_k ≤ y_k|Y_k−₁ = y_k−₁, … , Y₁ = y₁}, and let F₁(y₁) = P{Y₁ ≤ y₁}. Then the Rosenblatt transformation amounts to letting U₁=F₁(Y₁), U₂=F₂(Y₂|Y₁), U₃ = F₃(Y₃| Y₂, Y₁), …, and U_m = F_m(Y_m| Y_m−₁, … , Y₂, Y₁).

The problem is that the distributions F_k for k ≥ 1 are typically unknown and must be estimated (in a continuous fashion) from the Y_n data at hand. However, unless there is some additional structure, this estimation task may be unreliable or plain infeasible for large k. As an extreme example, note that to estimate F_n we would have only one point (in n-dimensional space) to work with. Hence, without additional assumptions, the estimate of F_n would be a point mass which is a completely unreliable estimate, and of little use in terms of constructing a probability integral transform due to its discontinuity.

An example of additional structure is the Markov setup. To elaborate, suppose that the data Y₁, … , Y_n are a realization of a stationary (and ergodic) Markov chain. In this case, the conditional distributions F_k for all k > 1 are completely determined by the one-step transition distribution, namely F₂. To see why, note that the Markov assumption implies that P{Y_k ≤ y_k|Y_k−₁ = y_k−₁, … , Y₁ = y₁} = P{Y_k ≤ y_k|Y_k−₁ = y_k−₁} for k > 1. Hence, the practitioner may use kernel smoothing or a related technique on the data pairs {(Y_j, Y_j+₁) for j = 1, … , n−1} in order to estimate the common joint distribution of these pairs. In turn, this yields estimates of F₁ and F₂, and by extension F_k for k > 2, so that the Rosenblatt transformation can be practically implemented as part of the Model-Free Prediction Principle.

Further examples of transformations applicable to diverse settings with regression and/or time series data are discussed in Politis (2015).

References

[1] Efron, B. (1979). Bootstrap methods: another look at the jackknife, Ann. Statist., vol. 7, pp. 1–26.

[2] Politis, D.N. (2015). Model-Free Prediction and Regression: A Transformation-Based Approach to Inference, Springer, New York.

[3] Rosenblatt, M. (1952). Remarks on a multivariate transformation. Ann. Math. Statist., vol. 23, pp. 470–472.

0 Comments

Comments on “Model-free inference in statistics: how and why”

Leave a Reply Cancel reply