Shrinkage Estimation of Large Covariance Matrices: Keep it Simple, Statistician?

Under rotation-equivariant decision theory, sample covariance matrix eigenvalues can be optimally shrunk by recombining sample eigenvectors with a (potentially nonlinear) function of the unobservable population covariance matrix. The optimal shape of this function reflects the loss/risk that is to be minimized. We solve the problem of optimal covariance matrix estimation under a variety of loss functions motivated by statistical precedent, probability theory, and differential geometry. A key ingredient of our nonlinear shrinkage methodology is a new estimator of the angle between sample and population eigenvectors, without making strong assumptions on the population eigenvalues. We also introduce a broad family of covariance matrix estimators that can handle all regular functional transformations of the population covariance matrix under large-dimensional asymptotics. In addition, we compare via Monte Carlo simulations our methodology to two simpler ones from the literature, linear shrinkage and shrinkage based on the spiked covariance model.


Introduction
Ever since Stein (1956) proved that the usual estimator of the mean is inadmissible in dimensions greater than three, decision theory has taken the edge over likelihood maximization in multivariate statistics. This leaves open the question of which loss function to minimize in a practical application. In this respect, the more loss functions available the better, as different researchers may pursue different goals. Regarding the second moments, that is, covariance matrix estimation, six loss functions have been investigated so far within the framework of large-dimensional asymptotics by Ledoit and Wolf (2018) and Engle et al. (2019), yielding a grand total of three different optimal nonlinear shrinkage formulas.
This paper delivers the technology to double the number of loss functions that can be handled from 6 to 12, without making strict assumptions. The six new loss functions considered are potentially attractive to applied researchers, as they have been promoted before by statisticians for decision-theoretical estimation of the covariance matrix. In order to achieve this degree of generality, we identify a formula from random matrix theory (RMT) that enables us to develop a new estimator of the angle of any sample eigenvector with any population eigenvector, in the large-dimensional limit. Using this new technique opens the door to addressing a large set of loss functions that were previously unattainable within the framework of large-dimensional asymptotics with the techniques of Ledoit and Wolf (2018): In addition to the six specific new loss functions considered, we can also handle two infinite general families of loss functions based on all regular transformations of the population covariance matrix.
Before starting to develop our methodology, it will be useful to give a brief review of the relevant literature. Likelihood maximization has done wonders for statistics in general; however, in the particular context of multivariate statistics when the number of parameters to be estimated is large, it tends to overfit in-sample data, at the expense of good out-of-sample performance. In reaction to that, decision theory favors estimators that perform well out-of-sample with respect to some given loss function. These estimators critically depend on the loss function selected by the end-user.
For covariance matrix estimation, we place ourselves firmly within the paradigm pioneered by Stein (1975Stein ( , 1986: (i) no assumption on the eigenvalues of the population covariance matrix apart from positive definiteness; (ii) equivariance with respect to rotation of the original orthonormal basis of variables; and (iii) full flexibility to modify the eigenvalues of the sample covariance matrix as deemed necessary. This is a tall order, and even Stein's finite-sample mathematical prowesses achieved limited progress. It was only after cross-pollination from RMT, a field originated by Nobel Prize-winning physicist Eugene Wigner (1955), and specifically the notion of large-dimensional asymptotics, that conclusive strides forward could be made. Charles Stein himself was well aware, as early as 1969, of the potential of large-dimensional asymptotics to unlock the multivariate application problems that preoccupied him (Stein, 1969, pp. 79-81). However, he left some work on the table for his intellectual successors in this respect. covariance matrix Σ with eigenvalues (τ 1 , . . . , τ p ), sorted in nondecreasing order without loss of generality (w.l.o.g.), and corresponding eigenvectors (v 1 , . . . , v p ).
(The case p > n is treated in Section 5.) The sample covariance matrix is S . . = Y Y /n. Its spectral decomposition is S = . . U ΛU , where Λ is a diagonal matrix and U is orthogonal. Let Λ = . . Diag(λ) where λ . . = (λ 1 , . . . , λ p ) , with the eigenvalues again sorted in nondecreasing order w.l.o.g. The ith sample eigenvector is u i , the ith column vector of U , so that S = p i=1 λ i · u i u i . Note that it holds similarly Σ = p i=1 τ i · v i v i . This class assumes no a priori information about the orientation of the orthonormal basis of (unobservable) population covariance matrix eigenvectors; this is different from the sparsity literature, which requires a priori knowledge of an orthonormal basis in which most covariances are zero. For many loss functions, there exists a finite-sample optimal (FSOPT) 'estimator' in this class of the form where γ denotes some smooth invertible function mapping of (0, +∞) onto R. Here, as is standard, applying a univariate function γ to a diagonalizable positive-definite matrix means preserving its eigenvectors and applying γ to each eigenvalue individually; for example, Remark 2.1. S in Equation (2.1) is obviously not feasible in practice, hence the single quotation marks around the word 'estimator'.
Remark 2.2. To simplify the notation, and in line with the related literature, we assume throughout the paper that all variables have mean zero. Appendix D deals with the case when this assumption is not (known to be) true.

A Brief Summary of Known Results on Nonlinear Shrinkage
So far, only six loss functions have been solved in the very general rotation-equivariant framework of Assumption 1 and Definition 2.1. In the second column of Table 2.1, the loss functions are streamlined for readability; the actual ones could be squared and have various constants added or multiplied in ways that are irrelevant to estimator optimality. The way to read the fourth column is that the ith sample eigenvalue λ i = u i Su i (i = 1, . . . , p) should be replaced by the quantity in the fourth column, optimally with respect to the same-row loss function, in finite samples: so it is the optimally 'shrunk' eigenvalue. We use the standard notation for the Frobenius norm of Frobenius S − Σ F Leung and Muirhead (1987) u i Σu i

Additional Loss Functions
The easiest way to start this investigation is to look for different loss functions that give rise to the same nonlinear shrinkage formulas as the ones in Section 2.2.  Table 2.2: Two more loss functions leading to existing nonlinear shrinkage formulas.
The second loss function is new. It is derived from the Sharma and Krishnamoorthy (1985) loss in the same way that the Inverse Frobenius loss is derived from the Frobenius loss, or that the Inverse Stein's loss of Ghosh and Sinha (1987) is derived from the original Stein's loss: by substituting the covariance matrix with its inverse, the precision matrix. At the same time, it has a more interesting justification as minus the quadratic utility function of Markowitz (1952) in large dimensions, as argued in Appendix A (hence the name disutility). It is a close cousin of the Minimum Variance loss function, with a tighter grip on the scale of the estimator. Reassuringly, both of them give rise to the same optimal nonlinear shrinkage formula.
There are three interlocking reasons for bringing up these loss functions, even though they fall back on the known estimators of Section 2.2. First, to avoid the well-known 'file-drawer problem' (also called publication bias), whereby results that are deemed less interesting remain unpublished. Second, some applied researcher may well look at one of these three loss functions and recognize that it suits his or her objective perfectly, in which case it does not matter whether the shrinkage formula is old or new. Third, in the end, the choice of estimator is a choice of shrinkage formula, and the best way to know what a specific shrinkage really means is to list as many loss functions as possible that lead to it.

New Shrinkage Formulas
The main point of the paper is to go beyond the two cases γ(x) = x ±1 and thereby to study other functions of the population covariance matrix (through the prism of sample eigenvectors).
We introduce four more:  Dowson and Landau (1982) Table 2.3: New set of finite-sample optimal (FSOPT) nonlinear shrinkage formulas.
Log-Euclidian It is defined as the Euclidian distance on the logarithm of the manifold of symmetric positive-definite matrices, hence the name. It is a close cousin of the geodesic distance on the smooth Riemannian manifold of positive-definite matrices. It has essentially the same properties, but is much more tractable for statistical applications. In particular, it is invariant with respect to matrix inversion, so eigenvalues close to zero are treated like eigenvalues close to infinity.
Fréchet The Fréchet discrepancy, named after the French mathematician Maurice Fréchet , is originally a measure of distance between two probability distributions. In the multivariate normal case, it directly implies a notion of distance between any two symmetric positive-definite matrices. Intuitively, we should think of it as a measure of 'how far apart' are the distributions that these two covariance matrices generate.
Quadratic This is a recent variant of the quadratic-type loss function that can be traced back to pioneers in the field such as Selliah (1964, Section 2.2.4) and Haff (1979b, loss function L 2 ). Its signature is that it promotes accuracy in the direction of the smallest principal components of the population covariance matrix.
Inverse Quadratic Same as above, but with the inverse sample covariance matrix. Mechanically, it promotes accuracy in the direction of the largest principal components of the population covariance matrix.
The logarithm and the square root are directly embedded into the first two shrinkage formulas (Log-Euclidian and Fréchet), but the square and inverse-square functions only appear in the last two loss formulas as part of combinations, echoing what happened with the Symmetrized Stein's loss. Proof that the loss functions in the second column of the tables give rise to the FSOPT formulas in the fourth column can be found in Appendix B.
These seven nonlinear shrinkage formulas gives rather different results. Researchers may wonder how they compare to each other. One interesting mathematical observation is that they do not cross, but one is always above (or below) the other across the whole spectrum -with the sole exception of Symmetrized Stein vs. Log-Euclidian shrinkage. The following proposition reveals the ordering.
Proposition 2.1. Under Assumption 1, with probability one, for all i = 1, . . . , p, Proof. Follows from Jensen's inequality and the Cauchy-Schwarz inequality once we remark x, or log(x), and that p j=1 u i v j 2 = 1, for every i = 1, . . . , p.

Preview of General Result
FSOPT 'estimators' of the form (2.1) cannot be used directly because they depend on the population covariance matrix Σ, which is unobservable. So it stands to reason to ask: How is it even possible that this approach leads anywhere? First of all, note that we do not need to estimate all p(p + 1)/2 entries of the symmetric matrix Σ, we only need p quantities: u i γ(Σ)u i , for i = 1, . . . , p, which is much more manageable. When the matrix dimension p is large, it is possible to approximate these quantities by the general formula where τ . . = ( τ 1 , . . . , τ p ) is an estimator of the population eigenvalues, andm τ n,p (·) is the complexvalued function of real argument due to Ledoit and Wolf (2015, Section 2). Formula (2.4) generates bona fide covariance matrix estimators of the type (2.1) for all the loss functions in Table 2.3 by setting γ(x) equal to log(x), √ x, x −2 , or x 2 . Given that u i γ(Σ)u i = 1 p p j=1 γ(τ j ) · p(u i v j ) 2 , the term between curly brackets in (2.4) is simply an estimator of the dimension-normalized squared dot product of the ith sample eigenvector with the jth population eigenvector.

Analysis Under large-Dimensional Asymptotics
We now move on to formally establishing that plugging the approximation (2.4) into the generic nonlinear shrinkage formula (2.1) yields optimal rotation-equivariant covariance matrix estimators under large-dimensional asymptotics with respect to the loss functions listed. First of all, to make the paper self-contained, we need to restate some sets of assumptions that have been used a number of times before. We shall do so in a condensed fashion; any unfamiliar reader interested in getting more background information should refer to some earlier paper such as, for example, Ledoit and Wolf (2018, Section 3.1) and the references therein.
In a nutshell: The dimension p goes to infinity along with the sample size n, their ratio p/n converges to some limit c ∈ (0, 1), and we seek to asymptotically optimize the way to nonlinearly shrink sample eigenvalues. Also, from now on, all dimension-dependent objects are subscripted by the sample size n.

Large-Dimensional Asymptotic Framework
Assumption 2 (Dimension). Let n denote the sample size and p . . = p(n) the number of variables.
It is assumed that the ratio p/n converges, as n → ∞, to a limit c ∈ (0, 1) called the limiting concentration (ratio). Furthermore, there exists a compact interval included in (0, 1) that contains p/n for all n large enough.

Assumption 3 (Population Covariance Matrix).
a. The p × p population covariance matrix Σ n is nonrandom symmetric positive-definite.
b. Let τ n . . = (τ n,1 , . . . , τ n,p ) denote a system of eigenvalues of Σ n , and H n their empirical distribution function (e.d.f.): H n (x) . . = p i=1 1 [τ n,i ,+∞) (x)/p, where 1 denotes the indicator function of a set. It is assumed that H n converges weakly to some limit law H, called the limiting spectral distribution (function).
c. Supp(H), the support of H, is the union of a finite number of closed intervals in (0, +∞).
Note that this assumption includes Johnstone's (2001) spiked covariance model as a special case where the limiting population spectral distribution H is a step function with a single step.
Assumption 4 (Data Generating Process). X n is an n×p matrix of i.i.d. random variables with mean zero, variance one, and finite 12th moment. The matrix of observations is Y n . . = X n √ Σ n .
Neither √ Σ n nor X n are observed on their own: only Y n is observed.
Note that this assumption includes Johnstone's (2001) spiked covariance model as a special case where the variates are assumed to be normal.
Remark 3.1 (Moment Condition). The existence of a finite 12th moment is assumed to prove certain mathematical results using the methodology of Ledoit and Péché (2011). However, Monte Carlo studies in Wolf (2012, 2015) indicate that this assumption is not needed in practice and can be replaced with the existence of a finite fourth moment. This is a generic requirement that does not depend on any particular loss function.
The sample covariance matrix is defined as S n . . = n −1 Y n Y n = n −1 √ Σ n X n X n √ Σ n . It admits a spectral decomposition S n = . . U n Λ n U n , where Λ n is a diagonal matrix, and U n is an orthogonal matrix: U n U n = U n U n = I n , where I n (in slight abuse of notation) denotes the identity matrix of dimension p × p. Let Λ n . . = Diag(λ n ) where λ n . . = (λ n,1 , . . . , λ n,p ) . We can assume w.l.o.g. that the sample eigenvalues are sorted in increasing order: λ n,1 ≤ λ n,2 ≤ · · · ≤ λ n,p . Correspondingly, the ith sample eigenvector is u n,i , the ith column vector of U n . Under Assumptions 2-4, the e.d.f. of sample eigenvalues F n (x) . . = p i=1 1 [λ n,i ,+∞) (x)/p converges almost surely to a nondeterministic cumulative distribution function F that depends only on H and c: How to go from (H, c) to F is determined by the following equation, due to Silverstein (1995): for all z in C + , the half-plane of complex numbers with strictly positive imaginary part, where m F denotes the Stieltjes (1894) transform of F , whose standard definition is: The Stieltjes transform admits a well-known inversion formula: if G is continuous at both a and b. Although the Stieltjes transform of F , m F , is a function whose domain is the upper half of the complex plane, it admits an extension to the real line, since Silverstein and Choi (1995) show that: ∀x ∈ (0, +∞), lim z∈C + →x m F (z) = . .m F (x) exists and is continuous. The imaginary part ofm F is the derivative of F , up to rescaling by π; therefore, (3.1) enables us to pin down the location of the sample eigenvalues, a fact exploited by the QuEST function; see Section 3.2. Furthermore, the support of the limiting distribution of the sample eigenvalue Supp(F ) is the union of a finite number κ ≥ 1 of compact intervals: Definition 3.1 (Rotation-Equivariant Estimators). We consider covariance matrix estimators of the type S n . . = U n D n U n , where D n is a diagonal matrix: D n . . = Diag( ϕ n (λ n,1 ) . . . , ϕ n (λ n,p )), and ϕ n is a (possibly random) real univariate function which can depend on S n .
Assumption 5 (Nonlinear Shrinkage Function). We assume that there exists a nonrandom real univariate function ϕ defined on Supp(F ) and continuously differentiable on κ k=1 [a k , b k ] such that ϕ n (x) a.s −→ ϕ(x) for all x ∈ Supp(F ). Furthermore, this convergence is uniform over , for any small η > 0. Finally, for any small η > 0, there exists a finite nonrandom constant K such that almost surely, over the set x ∈ κ k=1 [a k − η, b k + η], | ϕ n (x)| is uniformly bounded by K, for all n large enough.

The QuEST Function
Once again, to make the paper self-contained, we need to restate the definition of a key mathematical object called the QuEST (quantized eigenvalues sampling transform) function.
We shall do so in condensed fashion; the interested reader is referred to Wolf (2015, 2017) for full background information.
In a nutshell: QuEST is a multivariate deterministic function mapping population eigenvalues into sample eigenvalues, valid asymptotically as p and n go to infinity together.
Definition 3.2 (QuEST). For any given n and p, Q n,p maps t . .
and, for all x in R,m t n,p (x) is the unique solution m ∈ C + to the fundamental equation: Theorem 3.1 (Ledoit and Wolf (2015)). Suppose Assumptions 2-4 are satisfied. Define where Q n,p (t) is the QuEST function from Definition 3.2; both τ n and λ n are assumed sorted in nondecreasing order. Let τ n,j denote the jth entry of τ n (j = 1, . . . , p), and let τ n . . = (τ n,1 , . . . , τ n,p ) denote the population covariance matrix eigenvalues sorted in nondecreasing The functionm τ n n,p featured in the approximation (2.4) is a by-product of the QuEST function constructed by combining Equations (3.2)-(3.3). It estimates the complex-valued deterministic function of real argumentm F .

Dot Product of Population Eigenvalues with Sample Eigenvalues
Of much importance in this paper is the random bivariate cumulative distribution function first introduced in Equation (6) of Ledoit and Péché (2011) under the notation Φ N . From Θ n we can extract precise information about the relationship between sample and population eigenvectors. In theory, the dot product u n,i v n,j would be something worth looking at. However, the sign is irrelevant, so we focus on the square (u n,i v n,j ) 2 instead. Even then, we have to bear in mind that we operate under large-dimensional asymptotics, so all quantities need to be normalized by the ever-increasing matrix dimension p in appropriate fashion. In this particular instance, (u n,i v n,j ) 2 vanishes at the speed 1/p, as can be seen from the following identities: Therefore, it is more convenient to study p(u n,i v n,j ) 2 instead. The average of the quantities of interest p(u n,i v n,j ) 2 over the sample (respectively population) eigenvectors associated with the sample (respectively population) eigenvalues lying in the interval Thus, the object of interest is the Radon-Nikodym derivative of (the limit of) Θ n (x, t) with respect to the cross-product F (x) H(t); which is exactly what Equation (3.6) delivers.
Theorem 3.2 (Ledoit and Péché (2011)). Under Assumptions 2-4, ∀λ, τ ∈ R, Θ n (λ, τ ) converges almost surely to some nonrandom bivariate c The Radon-Nikodym derivative θ(λ n,i , τ n,j ) is 'essentially like' the squared dot product p(u n,i v n,j ) 2 for large p and n. In order to operationalize Equation (3.6), we need bona fide estimators for its ingredients, and they are provided by Section 3.2's QuEST function: Although the expression may seem a bit unusual, it is just what comes out of RMT, and we should count ourselves lucky to have any closed-form solution at all. This 'luck' is first and foremost due to the pioneering efforts of probabilists who came before. If Equations (3.1), (3.2), (3.6), and (3.7) appear to be descendents from each other, it is because they are. A graphical illustration in the case where the population eigenvalues are evenly spread in the interval [1, 5], with concentration ratio p/n = 0.5, is given by Figure  On the horizontal axes, eigenvectors are indexed by their respective eigenvalues.
One can see that the spread of sample eigenvalues is much wider: from 0.2 to 10.2. Top-ranked sample eigenvectors are more aligned with top-ranked population eigenvectors, and bottomranked sample eigenvectors are more aligned with bottom-ranked population eigenvectors. The overall pattern is complicated and can only be captured by the function θ of Theorem 3.2.

Asymptotically Optimal Nonlinear Shrinkage Estimators
The two loss functions from Table 2.2 are easy to handle using the techniques of Ledoit and Wolf (2018): The nonlinear shrinkage estimator that they call S * n is optimal with respect to the Weighted Frobenius loss under large-dimensional asymptotics; and the estimator that they call S • n is optimal with respect to the Disutility loss. These results are stated without proof, as they are just minor extensions of the arguments put forward by Ledoit and Wolf (2018).
Regarding the loss functions of Table 2.3, they are vastly more challenging, and cannot be handled with existing techniques. Instead, they can only be handled by using the new technique of angle estimation introduced in Section 3.3 above, as we shall now proceed to demonstrate.

Four Specific Loss Functions
All remaining theorems are proven in Appendix C. We start with asymptotically optimal bona fide estimators based on Table 2.3.
Theorem 4.1 (Log-Euclidian). For any estimator S n in Definition 3.1, the Log-Euclidian loss converges under Assumptions 2-5 almost surely to a deterministic limit that depends only on H, c, and ϕ. This limit is minimized if ϕ n (λ n,i ) is equal to where τ n = τ n,j j=1,...,p denotes the estimator of population covariance matrix eigenvalues in Theorem 3.1, and θ n (λ n,i , τ n,j ) is the estimator of the (dimension-normalized) squared dot product of the ith sample eigenvector with the jth population eigenvector in Equation (3.7). The The resulting covariance matrix estimator is S FRÉ Theorem 4.3 (Quadratic). The Quadratic loss L Q S n , Σ n . . = Σ −1 n S n − I n 2 F p converges almost surely to a deterministic limit that is minimized if ϕ n (λ n,i ) is equal to . (4.4) The resulting covariance matrix estimator is S Q Theorem 4.4 (Inverse Quadratic). The Inverse Quadratic loss function, which is defined as . (4.5) The resulting covariance matrix estimator is S QINV

Two Infinite Families of Loss Functions
We have so far covered 12 loss functions, including many of the classic ones, from which we have derived a total of 7 different optimal nonlinear shrinkage formulas (as there are some commonalities). It is tedious to keep adding more by hand. Most applied researchers should have already been able to find 'the shoe that fits' in this rather extensive list by now.
If not, the only systematic method is to study an (uncountably) infinite number of loss functions, and to find the nonlinear shrinkage formula exactly optimized with respect to each of them. To the best of our knowledge, an ambitious project on this scale has never been envisioned before. In doing so, we will meet again some old acquaintances: 6 of the 12 loss functions already analyzed manually are special cases of the two general theorems presented below. The first infinite family of loss functions is what we call Generalized Frobenius.
Theorem 4.5 (Generalized Frobenius). For any invertible and continously differentiable function γ defined on (0, +∞), the Generalized Frobenius loss L γ,F n ( S n , Σ n ) . . = γ S n − γ Σ n 2 F p converges almost surely to a deterministic limit that is minimized if ϕ n (λ n,i ) is equal to The resulting covariance matrix estimator is S γ n . . = p i=1 ϕ γ n λ n,i · u n,i u n,i .
The Frobenius, Inverse Frobenius, Log-Euclidian, and Fréchet losses are special cases of the General Frobenius family, corresponding, respectively, to γ(x) equal to x, 1/x, log(x), and √ x.
A second infinite family of loss functions is based on the Kullback and Leibler (1951) divergence. Given two multivariate normal distributions N(0, A i ) with zero mean and covariance matrix A i , for i ∈ {1, 2}, their dimension-normalized Kullback-Leibler divergence is: Stein's loss and the Inverse Stein loss are special cases of the Generalized Kullback-Leibler family defined below, obtained by setting γ(x) equal to 1/x and x, respectively.
Both infinite families of loss functions confirm the asymptotic optimality of the same infinite family of nonlinear shrinkage estimators S γ n . The Frobenius norm is important because it is just the Euclidian distance on the space of matrices, whereas the Kullback-Leibler divergence is important in a completely different field: information theory. Two justifications coming from such different perspectives combine to give strong backing to the covariance matrix estimator S γ n , no matter which function γ the end-user is interested in.
Remark 4.1. The three other nonlinear shrinkage formulas that do not fit into the mold of Equation (4.6) are just elementary combinations of ϕ γ n (·) for two different γ functions.
Remark 4.2. It should be pointed out that, apart from the two special cases γ(x) = x ±1 , these two infinite families of loss functions can also only be handled by using the new technique of angle estimation introduced in Section 3.3 above.

Singular Case: p > n
This is a case of great practical importance. When it happens, the sample covariance matrix is singular: It has p − n eigenvalues equal to zero and is thus only positive semi-definite. There then exist some linear combinations of the original variables that falsely appear to have zero variance when one only looks in-sample. In a sense, the sample covariance matrix, with its p(p + 1)/2 degrees of freedom, 'overfits' the data set of dimension n × p.

Analysis in Finite Samples
With respect to the loss functions studied in this paper, the optimal nonlinear shrinkage formula applied to the n non-zero sample eigenvalues remains the same as in the case p < n, so no need to revisit. The only item to be determined is how to shrink the p − n null sample eigenvalues.
Recall that we sort the sample eigenvalues in nondecreasing order w.l.o.g., so the null eigenvalues are the first p − n ones. To build intuition, we start as before with the finite-sample case:  The pattern is clear: compute how all the eigenvectors in the null space of the sample covariance matrix relate to (a function of) the population covariance matrix, take the average(s), and take smooth transformations of the average(s), where a transformation could be simply the identity. There is a rotational indeterminacy in this null space of dimension p − n, but the formulas in the last column are invariant to a rotation of the basis of the null eigenvectors, so it does not matter.

Analysis Under Large-Dimensional Asymptotics
Assumption 6 (Singular). The ratio p/n converges, as n → ∞, to a finite limit c > 1.
Furthermore, there exists a compact interval included in (1, +∞) that contains p/n for all n large enough.
Given that the first p − n sample eigenvalues are devoid of informational content, it is judicious to focus on the e.d.f of the n other ones: ∀x ∈ R F n (x) . . = 1 Assumptions 3-6, it admits a nonrandom limit: Of particular interest will be its Stieltjes transform: ∀z ∈ C + m F (z) . . = 1 λ−z dF (λ), which admits a continuous extension onto the real line: ∀x ∈ Rm F (x) . . = lim z∈C + →x m F (z).

Optimal Shrinkage of Null Sample Eigenvalues
At this stage, what we need is an equivalent of Equation (2.4) that pertains to the shrinkage of the null sample eigenvalues. It comes from Theorem 9 of Ledoit and Péché (2011): where τ n . . = { τ n,j } p j=1 is, as before, the estimator of population eigenvalues obtained by numerically inverting the QuEST function, andm τ n n,p (0) is a strongly consistent estimator ofm F (0) which is another by-product of the QuEST function (when p > n). As per Ledoit and Wolf ( Eigenvectors in the null space of the sample covariance matrix tend to be more (less) aligned with population eigenvectors corresponding to small (large) population eigenvalues, which makes intuitive sense. The degree of preferential alignment is inversely related to the concentration ratio, as a high ratio p/n disorients the sample eigenvectors. The overall pattern is highly nonlinear, and could only be pinned down through Equations (5.3)-(5.4) from RMT. Note that, by construction, the dimension-normalized density of the squared dot-product averages to 1, so it is deviations from the baseline number of 1 that are informative.

Covariance Matrix Estimation in the Singular Case
Theorems 4.1-4.6 remain valid when c > 1, with the understanding that the estimator of the squared dot-product in the null space of the sample covariance matrix (i = 1, . . . , p − n) is ∀j = 1, . . . , p θ n (λ n,i , τ n,j ) = θ n (0, τ n,j ) . . = 1 1 − n p 1 +m τ n,p (0) τ n,j . (5.5) In order to show how this works, we need only state and prove the singular-case counterpart of Theorem 4.5, as the other theorems are adapted from p < n to the p > n case in similar fashion.
Theorem 5.1. Under Assumptions 3-6, the Generalized Frobenius loss admits an almost sure (deterministic) limit, which is minimized by the nonlinear shrinkage formula where the bivariate function θ n (x, t) is given by Equation (3.7) for x > 0, and θ n (0, t) is given by Equation (5.5). The resulting covariance matrix estimator is S γ n . . = p i=1 ϕ γ n λ n,i · u n,i u n,i .

Monte Carlo Simulations
The goal of this section is to illustrate on simulated data that there is generally great benefit in using the shrinkage estimator that is tailored to the loss function one has selected.

General Setup
The population eigenvalues are distributed as follows: 20% are equal to 1, 40% are equal to 3, and 40% are equal to 10. This is a challenging problem originally introduced by Bai and Silverstein (1998). We use the 12 loss functions from Tables 2.1-2.3. For each one, we compute the FSOPT 'estimator' specific to the particular loss function, as well as all 7 bona fide shrinkage estimators presented in the paper. We use the same notation as Ledoit and Wolf (2018): S • n is the estimator optimal with respect to Frobenius, Inverse Stein and Minimum Variance losses; S * n is the one optimal with respect to Stein and Inverse Frobenius losses; and S n the one optimal with respect to the Symmetrized Stein's loss. In addition, the identity matrix (rescaled to have same trace as the sample covariance matrix), the sample covariance matrix, and the linear shrinkage estimator of Ledoit and Wolf (2004) are also computed for reference purposes. The results are averaged over 1,000 simulations.

Nonsingular Case
To produce the results of Table 6.1, the matrix dimension is p = 100 and the sample size is n = 200.
In each row, the performance of the best bona fide estimator is printed in bold. One can see that the winner is always the estimator tailor-made for the loss function of the given row.
Sometimes the difference with the other estimators is quite stark. Obviously, the FSOPT always dominates, but usually the excess loss of the best bona fide estimator is quite small. This finding reinforces the message that the asymptotically optimal estimators listed in the present paper perform as well as they ought to, even in finite samples.
Regarding the other (reference) estimators, linear shrinkage does better than the two ingredients that it interpolates, the scaled identity matrix and the sample covariance matrix, with respect to all but one of the 12 loss functions. This is good news because in theory its shrinkage intensity is optimized with respect to the Frobenius loss only. Linear shrinkage performs honorably across the board for such a simple estimator: it even manages to beat some nonlinear shrinkage estimators in almost every row, typically a couple of them. Needless to say, linear shrinkage never beats the nonlinear shrinkage formula optimized to the loss function in the given row, which shows that it 'leaves some money on the

Comparison of Shrinkage Formulas
Confirming the ordering of Proposition 2.1, Figure  The Quadratic and the Inverse Quadratic shrinkage formulas stand out as 'outliers', as shown by Proposition 2.1. In Table 6.1, the estimators S Q n and S QINV n display erratic performances when measured against other loss functions than their own. The other estimators are better able to deliver respectable performance across foreign loss functions. The estimators S • n and S * n have strong backing, from the Minimum-Variance and Stein's loss, respectively; the Log-Euclidian estimator S LE n represents a'neutral' compromise that has strong foundations in the differential geometry of the manifold of tensors (a.k.a. positive definite matrices). Table 6.2 presents further results when p = 200 and n = 100. Once again, the pattern is confirmed overall, except for one violation: S LE n beats S n both 'home' and 'away': with respect to the Log-Euclidian loss and, unexpectedly, with respect to the Symmetrized Stein's loss also.

Singular Case
(In other simulations not reported here, we double-checked that S n does beat S LE n with respect to the Symmetrized Stein's loss when dimension is high enough, as implied by large-dimensional asymptotic theory.) Both of these estimators plow the same narrow but interesting field of estimators that are equivariant with respect to matrix inversion, so it is not completely surprising that the estimator that beats S n on its home turf shares the same desirable property.
Remarks regarding the two simple estimators (scaled identity and linear shrinkage) essentially go in the same direction as in Section 6.2. We excluded the sample covariance matrix because it is not invertible, so most of the loss functions return +∞.

Comparison with Simpler Alternatives: The Matrix as a Whole
We examine two alternative approaches that make compromises in order to obtain formulas that are simpler than the ones developed in this paper. On the one hand, linear shrinkage (Ledoit and Wolf, 2004) compromises by forcing all eigenvalues to be shrunk towards the same target with the same shrinkage intensity, and by considering only the Frobenius loss. On the other hand, the spiked-model approach (Donoho et al., 2018) compromises by assuming that the bulk of the population eigenvalues (meaning: all of them except for a vanishing fraction of them, called spikes) are equal.
In this Monte Carlo simulation, we take both the simpler linear shrinkage and the simpler spiked model 'outside of their comfort zone' by considering 8 different loss functions, and by considering specifications where the bulk of the population eigenvalues can be different from each other. Most applied researchers will be interested to know how robust the simplified formulas are against violations of the framework under which they have been derived.
The 8 loss functions that we consider are all of the ones in the intersection of the 12 that we consider in the present paper with the 18 for which Donoho et al. (2018)  Stein, 6) Fréchet, 7) Quadratic, and 8) Inverse Quadratic. The code for spike shrinkage was taken directly from Donoho et al. (2016). As far as the population eigenvalues are concerned, the initial specification is to have a single spike at 10, and the p − 1 bulk eigenvalues equal to 1.
From this base, we will allow for heterogeneity in the bulk by keeping half of the bulk equal to 1, while setting the other half equal to τ ∈ [1, 5]. It is only fair to allow bulk eigenvalues to not all be equal: after all, this is the generic case, and the special case where all bulk eigenvalues are equal is a measure-zero subset of the set of all possible eigenvalue combinations, so it is not necessarily representative of real-world applications. We put 100 eigenvalues in the bulk, plus (as mentioned above) a single spike, for a total of p = 101 eigenvalues. We take the (limiting) concentration ratio to be c = 1/2, which implies n = 202.
where L i n denotes one of the eight loss functions listed above, S * ,i n denotes the FSOPT 'estimator' tailored to each specific loss function as per Tables 2.1-2.3, S n denotes the estimator under consideration (whether linear shrinkage, spike shrinkage, or nonlinear shrinkage), and the expectation is approximated by the average of 1,000 Monte Carlo simulations. By construction, the PRIAL of the sample covariance matrix is 0% whereas the PRIAL of the FSOPT 'estimator' is 100%. The PRIAL measures how much of the potential for improvement relative to the sample covariance matrix is attained by a given estimator S n .
One can see that, even though the dimension is not overly large (p ≈ 100), nonlinear shrinkage captures nearly 100% of the potential improvement with respect to all loss functions, regardless of how spread out are the bulk population eigenvalues. Linear shrinkage has more of a mixed performance, but still manages to capture at least 50% of the potential improvement most of the time. It beats the sample covariance matrix with respect to all 8 loss functions, which shows that its attractiveness extends far beyond the Frobenius loss under which it was originally derived.
It beats spike shrinkage as long as τ ≥ 2.5 in all cases but one (the Quadratic loss, where they are essentially identical). Also worth noting is that linear shrinkage is the only estimator that keeps the same formula in all 8 subplots of Figure 6.2, so it is 'fighting with one hand tied behind the back' when it has to compete against the other two shrinkage estimators under the 7 loss functions different from Frobenius loss.
As expected, the performance of spike shrinkage is near-perfect when its specification matches reality (τ = 1: all bulk eigenvalues are equal), but it monotonically degrades as soon as the bulk population eigenvalues become heterogeneous. This drop in performance is not so pronounced with the Inverse Frobenius, Inverse Stein and Inverse Quadratic losses, but it is very pronounced with the 5 other loss functions. There is even a case (τ = 5 and Fréchet loss) where spike shrinkage underperforms the sample covariance matrix, which results in a negative PRIAL. This is a result that should be expected purely from theory: Unlike linear and nonlinear shrinkage, spike shrinkage can actually be worse than the sample covariance matrix even in the large-dimensional Compared to the two simpler alternatives, nonlinear shrinkage does better across the board.
In particular, there are scenarios where the optimal nonlinear shrinkage formula is (nearly) linear; and, even then, nonlinear shrinkage performs just as well as linear shrinkage, for all practical purposes. Similarly, there are scenarios where the spiked covariance model holds perfectly true (bulk maximum equal to one); and, even then, nonlinear shrinkage performs just as well as spike shrinkage, for all practical purposes.
The overall conclusion is that, among the simpler formulas, linear shrinkage can 'leave some money on the table' when the optimal shrinkage is highly nonlinear whereas spike shrinkage is vulnerable to the risk that its stringent specification of bulk-eigenvalue equality is violated by reality. Only the full-blown nonlinear shrinkage formulas derived in this paper avoid both pitfalls and deliver state-of-the-art enhancement of the sample covariance matrix across the board.

Comparison with Simpler Alternatives: Focus on the Spike
There might be a perception that nonlinear shrinkage does better on the bulk, whereas spike shrinkage does better on the spike. This is hard to justify formally, as the loss functions used in the spike literature pertain to the whole covariance matrix, placing no special over-weight on the spike. Nonetheless, in Table B.1 we isolate the spike's contribution to overall risk, for the 8 loss functions from Section 6.5. This is easy to do because every one of these loss functions can be decomposed as a sum of contributions attributable to each of the p eigenvalues. Figure 6.3 displays the Monte Carlo simulation results.
The performance of the sample covariance matrix can be so erratic as to not even be on the same scale as the other estimators in many scenarios. The risk contribution of the FSOPT is zero by construction. The pattern is that spike shrinkage and nonlinear shrinkage estimate the spike equally well when the spike model's assumption of equal bulk eigenvalues is satisfied, or nearly satisfied. However, as the bulk spreads out, nonlinear shrinkage estimates the spike more accurately than spike shrinkage for all 8 loss functions where methods overlap.

Conclusion
In this paper, we have • developed a new estimator of the angle between any sample eigenvector and any population eigenvector by exploiting a sophisticated equation from random matrix theory (RMT); • doubled the number of loss functions that can be handled from 6 to 12 (compared to related earlier work), which can only be achieved by the new technique of angle estimation; • proposed a classification of loss functions by their finite-sample optimal shrinkage formulas; • increased the number of asymptotically optimal nonlinear shrinkage formulas from 3 to 7 (compared to related earlier work); • established an ordering of the nonlinear shrinkage formulas (from largest to smallest); • and introduced a new loss function founded on the economic concept of utility maximization.
As a simpler alternative approach, Donoho et al. (2018) consider a ménagerie of 26 loss functions under the spiked covariance model of Johnstone (2001). The key distinction in this model is between the bulk, which is comprised by eigenvalues packed shoulder-to-shoulder like sardines, and the spikes, which are a few select eigenvalues large enough to separate from the bulk. Donoho et al. (2018) treat the spikes carefully, but they just collapse the bulk. This approach is perfectly legitimate under the assumption that they make, namely, that all bulk population eigenvalues are equal. However, in many applications, such an assumption is unrealistic or may not be known to hold. In the general case, bulk population eigenvalues are nonequal, so valuable information can be gleaned from the angle between sample and population eigenvectors, and from applying differentiated shrinkage inside the bulk. Monte Carlo simulations show that the resulting nonlinear shrinkage performs, for all practical purposes, just as well as spike shrinkage when bulk population eigenvalues are equal, but is often much better when they are unequal.
Therefore, the KISS (Keep it simple, statistician!) principle does not seem to benefit applied researchers: by upgrading instead from spike shrinkage to full-blown nonlinear shrinkage, they have, basically, nothing to lose but much to gain. In addition, at least currently, spike shrinkage is only available for the case where the dimension is smaller than the sample size, which limits practical applications.
Having said this, Donoho et al. (2018) roll out a clever technology that convincingly documents three closely interrelated facts that have not garnered sufficient attention in this field: 1. The choice of loss function has a profound effect on optimal estimation.
2. Eigenvalue inconsistency: The sample eigenvalues are spread, biased, and shifted away from their theoretical (population) counterparts by an asymptotically predictable amount.
3. Eigenvector inconsistency: The angles between the sample eigenvectors and the corresponding population eigenvectors have nonzero asymptotic limits.
Such fundamental truths need to be hammered in again and again, in every possible way.
Finally, we may say a word about the choice of loss function. 12 of them have been solved already, yielding seven different nonlinear shrinkage formulas, in addition to the two infinite families, which should be more than enough to satisfy any reasonable need. By definition, it is the duty of the end-user to pick the loss function, but perhaps some light-touch guidance can help orient readers through a forest with so many trees. For anyone interested in using a covariance matrix estimator to minimize variance, risk, or noise in any sense, certainly the Minimum Variance loss function is the appropriate one; an additional advantage is that for this loss function a new technology has arisen that is no more complex than kernel density estimation, and so is extremely fast and scalable to ultra-high dimensions (Ledoit and Wolf, 2020a). For researchers concerned with the decision-theoretic aspects of the problem, a loss function based on the Kullback-Leibler divergence (also called relative entropy), such as Stein's loss, is the natural candidate. For other applications, such as fMRI tensors, where it is important to regard eigenvalues close to zero as being 'as distant' as eigenvalues close to infinity, then the Log-Euclidian loss function is well suited: It appears a good compromise because it produces shrunken eigenvalues that lie in between the ones from the Minimum-Variance loss and the ones

A Portfolio Selection and the Disutility Loss
Here we explain how the Weighted Frobenius loss of Sharma and Krishnamoorthy (1985) applied to the precision matrix can be interpreted as quadratic disutility. Consider the standard mean-variance optimization problem with quadratic utility function: where µ denotes some vector of expected return selected by the end-user, and ρ > 0 the risk aversion parameter (cf. Markowitz (1952)). The first-order condition is µ − ρΣw = 0, and the solution is w = Σ −1 µ/ρ. In practice, we only observe an estimator S of the unobservable population covariance matrix Σ, so the plug-in estimator for the optimal weight vector is w = S −1 µ/ρ. The quadratic utility associated with this vector is At this point, the risk aversion coefficient ρ becomes irrelevant because, regardless of ρ, all investors want to find a covariance matrix estimator S that maximizes As argued in further detail in Engle et al. (2019, Section 4.3), under large-dimensional asymptotics in conjunction with RMT, there is a key approximation: From this we can streamline the objective function so as to make it equally adept at fitting the needs of all users who may have different views on the choice of vector µ: The squared Euclidian norm of the linear constraint vector µ becomes irrelevant to the estimation process, so we are left with just maximizing Tr S −1 − 1 2 Tr S −1 Σ S −1 . 1 This is obviously equivalent to minimizing with respect to the rotation-equivariant estimator S the shifted loss function − Tr S −1 + 1 2 We recognize immediately the Weighted Frobenius loss function applied to the precision matrix up to some multiplicative renormalizations. This approach nicely dovetails with the Minimum Variance loss function of Engle et al. (2019), as it gives the same optimal nonlinear shrinkage formula, but pins down the scaling factor internally rather than by appealing to the external argument of trace preservation.
1 Note the close connection with the Minimum Variance loss function, which was essentially based on Tr S −1 Σ S −1 Tr S −1 2 . So, instead of dividing, we are subtracting here. Given that both of them are based on mean-variance portfolio optimization, it is reassuring to observe that they do not contradict each other.

B Finite-Sample Optimal Estimators for Various Losses
In this section, all loss functions are normalized by dimension so they admit an almost sure limit under large-dimensional asymptotics. Given that the objective is to optimize over the rotation-equivariant covariance matrix estimator S, whose eigenvectors are restrained to be equal to those of the sample covariance matrix S, we call "constant" any quantity that does not depend on the eigenvalues of S. In addition, for the eight loss functions that overlap with Table 2 of Donoho et al. (2018), we indicate how to partial out the incremental contribution to loss attributable to imperfect shrinkage of the largest eigenvalue, in order to be able to focus on the spike in some of the simulations in Section 6.

B.2 Inverse Stein
From Equation (B.9), we can see that the incremental contribution to Inverse Stein loss due to imperfect shrinkage of the largest eigenvalue is equal to where the scalar is independent of i = 1, . . . , p. Any cursory inspection of the Minimum Variance loss function L MV ( S, Σ) immediately reveals that the scalar cannot be determined internally, because multiplying the estimator S by any strictly positive scalar just washes out. Therefore, we have to invoke other arguments to make a choice. By preservation of the trace, the scalar should be set equal to one.

B.4 Stein
From Equation (B.24), we can see that the incremental contribution to Stein's loss due to imperfect shrinkage of the largest eigenvalue is equal to

B.5 Inverse Frobenius
we can see that the incremental contribution to Inverse Frobenius loss due to imperfect shrinkage of the largest eigenvalue is equal to From Equation (B.34), we can see that the incremental contribution to Symmetrized Stein loss due to imperfect shrinkage of the largest eigenvalue is equal to Note that the last term, which is a constant in the sense that it does not depend on the shrunken eigenvalue d p , has been artificially added so that (B.38) is equal to zero when Equation (B.37) is satisfied, and strictly positive otherwise.

B.7 Weighted Frobenius
which is clearly minimized when d i = exp u i log(Σ)u i , for all i = 1, . . . , p. B.11 Quadratic

B.10 Fréchet
The first-order condition (FOC) is obtained as follows: From Equation (B.63), we can see that the incremental contribution to Quadratic loss due to imperfect shrinkage of the largest eigenvalue is equal to As with the Symmetrized Stein loss, we added a term so that (B.66) is zero for the FSOPT.

B.12 Inverse Quadratic
The algebraic manipulations are the same as above, except that Σ −1 becomes Σ and S becomes S −1 , so there is no point repeating the intermediary steps.

B.13 Generalized Frobenius
To defuse a well-known source of confusion, we use the notation γ −1 (x) to signify the inverse function of the invertible function γ, and γ(x) −1 to signify one divided by γ(x). For example, if B.14 Generalized Kullback-Leibler Divergence

B.15 Summary of Spike Contributions to Loss
To wrap up the finite-sample analysis, Table B.1 isolates the contribution of the largest eigenvalue to the overall loss, for the eight loss functions studied in Section 6.5. These have been standardized so that the FSOPT does not contribute to loss, by construction.

Contribution of Spike
Frobenius

C.1 Quadratic
Proposition C.1. Under Assumptions 2-5, Proof. For simplicity, let us assume that the support of F is a single compact interval [a, b] ⊂ (0, +∞); the generalization to the case κ > 1 is trivial. From Appendix B.11 we have: where Θ n is the random bivariate function from Equation (3.4)). By applying the technique from the proof of Theorem 3.1 of Ledoit and Wolf (2018), and by using Theorem 3.2 to handle the function Θ n , it follows that where, as per Equation (3.6), Proposition C.1 allows us to characterize the asymptotically optimal nonlinear shrinkage function under Quadratic loss. . (C.7) Proof. If we fix x ∈ Supp(F ), then the marginal contribution of ϕ(x) to the almost sure (nonrandom) limit of the loss function L Q n (Σ n , S n ) is .

(C.17)
Proof. If we fix x ∈ Supp(F ), then the marginal contribution of ϕ(x) to the almost sure (nonrandom) limit of the loss function L QINV The partial derivative of (C.18) with respect to ϕ(x) is (C.19) The first-order condition is The solution is . (C.21) The proof of Theorem 4.4 is concluded as before: To the unobservable quantity c corresponds the plug-in estimator p/n; to the unobservable quantity H(t) corresponds the plug-in estimator H n (t) . . = p i=j 1 [ τ n,j ,+∞] (t)/p; and to the unobservable quantity θ(x) corresponds the plug-in estimator θ n (x, t) from Equation (3.7). The fact that these three unobservable quantities can be replaced with their respective plug-in counterparts at no loss asymptotically is established in the same way as in the proof of Ledoit and Wolf's (2018) Theorem 5.2.

D Demeaning
To simplify the notation, and in line with the related literature, we assume throughout the paper that all variables have mean zero. In many applications, variables do not have mean zero, or at least it is not known whether they do. In such a setting, it is more common to base the sample covariance matrix on the demeaned data instead: S n . . = Y n Y n /(n − 1), where Y n is obtained from Y n by the operation of columnwise demeaning. In this case, n needs to be replaced everywhere with the 'effective' sample size n − 1; see Ledoit and Wolf (2020b, Section 6).
As shown at the beginning of Section 3 of Silverstein and Bai (1995), demeaning is a rank-one perturbation, at least for i.i.d. data as considered in our context, which in turn, thanks to Lemma 2.5a of the same paper, implies that it has no impact on large-dimensional asymptotic convergence results.
On the other hand, Pan (2014) shows that central limit theorems of sample eigenvalue statistics are different under large-dimensional asymptotics when demeaning is used versus not used. This could at least cause some concern that the finite-sample performance of our estimators might be affected by demeaning. To address this concern, we repeat the Monte Carlo exercise of Section 6.2, using the same random seed, but now base the sample covariance matrix on the demeaned data instead, in conjunction with replacing n everywhere with the 'effective' sample size n − 1. The results are presented in Table D.1 and it can be seen that they are not meaningfully different from the results in Table 6 Table D.1: Average losses computed for various estimators when p = 100 and n = 200, with demeaning and adjustment to the effective sample size when computing the sample covariance matrix. Best numbers are in bold face. The random seed is the same as the one used for computing the numbers in Table 6.1.