The small-n-large-P situation has become common in genetics research, medical studies, risk management, and other fields. Feature selection is crucial in these studies yet poses a serious challenge. The traditional criteria such as AIC, BIC, and cross-validation choose too many features. In this paper, we examine the variable selection problem under the generalized linear models. We study the approach where a prior takes specific account of the small-n-large-P situation. The criterion is shown to be variable selection consistent under generalized linear models. We also report simulation results and a data analysis to illustrate the effectiveness of EBIC for feature selection.
Keywords and phrases: Consistency, exponential family, extended Bayes information criterion, feature selection, generalized linear model, small-n-large-P.
In many scientific investigations, researchers explore the relationship between a response variable and some explanatory features through a random sample. Examples of such features include disease genes and quantitative trait loci in the human genome, biomarkers responsible for disease pathways, and stocks generating profits in investment portfolios. The selection of causal features is a crucial aspect in this. When the sample size n is relatively small but the number of features P under consideration is extremely large, there is a serious challenge to the selection of causal features. Feature selection in the sense of identifying causal features is different from, but often interwoven with, model selection; the latter involves two operational components: a procedure for selecting candidate models, and a criterion for assessing the candidate models. In this article, we concentrate on the issue of model selection criteria.
Traditional model selection criteria such as Akaike’s information criterion (AIC) (Akaike (1973)), cross-validation (CV) (Stone (1974)) and generalized cross-validation (GCV) (Craven and Wahba (1979)) essentially address the pre- diction accuracy of selected models. The popular Bayes information criterion (BIC) (Schwarz (1978)) was developed from the Bayesian paradigm in a differ- ent vein. BIC approximates the posterior model probability when the prior is uniform on the model space. However, in the small-n-large-P situation, these cri- teria become overly liberal and fail to serve the purpose of feature selection. This phenomenon has been observed by Broman and Speed (2002), Siegmund (2004), and Bogdan, Doerge and Ghosh (2004) in genetic studies. See also Donoho (2000), Singh et al. (2002), Marchini, Donnelly, and Cardon (2005), Clayton et al. (2005), Fan and Li (2006), Zhang and Huang (2008), and Hoh, Wille, and Ott (2008). Some recent BIC related model selection procedures in new situations can be found in Wang, Li, and Tsai (2007), Jiang et al. (2008) and many others.
Recently, Chen and Chen (2008) pointed out that the uniform prior on the model space is the cause of BIC’s liberality in small-n-large-P situation. Cor- rection of this problem leads to a family of extended Bayes information criteria (EBIC). Bogdan, Doerge and Ghosh (2004) made the same observation but pro- vided slightly different correction measures. Mathematically, the EBIC is the classical BIC with an additional penalty term 2γ log P with a positive γ. Inter- estingly, Foster and George (1994) found that, instead of adding the log P term, simply replacing log n with 2 log P in BIC gives empirically optimal results in view of risk inflation, and their finding was echoed in Abramovich et al. (2006). The EBIC is shown to be selection consistent in the small-n-large-P frame- work under the normal linear model. Its validity under a wide class of regression models remains an unsolved problem. In this paper, we have tailor-developed technical results for exponential family distributions which are of interest in themselves. They are particularly useful in proving the uniform consistency of the maximum likelihood estimates of the coefficients in the linear predictor of all generalized linear models (GLM) containing causal features (Theorem 1), and the selection consistency of EBIC under GLM with canonical links (Theorem 2). In the implementation, we need to place an upper bound K on the number of causal features. If K is chosen too small in an application, the selection consis- tency of the procedure with EBIC may not be realized. To tackle this issue, we show that if K is chosen too small the EBIC will select a model exhausting all K features. If the selected model exhausts all the K features allocated, reanalyzing the data with a larger K is suggested.
We have also investigated the performance of EBIC by simulation under the logistic regression model and the Poisson log-linear model. The logistic regression model is valid in both prospective and retrospective studies, see McCullagh and Nelder (1989, Chap. 4), and is a major approach in genetic research, see for ex- ample The Wellcome Trust Case-Control Consortium (2007). In principle, EBIC is an all subsets method which is computationally infeasible. Our implementa- tion strategy for EBIC follows that of Wang, Li, and Tsai (2007) and Zhang, Li, and Tsai (2010). We use regularization methods such as LASSO (Tibshirani (1996)), SCAD (Fan and Li (2001)) or Elastic Net (Zou and Hastie (2005)) to obtain regression models with various levels of sparsity. Because they only deter- mine the order of the penalty level for selection consistency, some cross validation procedure is ultimately used to select the final model. Replacing the computer intensive cross validation procedure by EBIC creates a promising new approach. In simulation, we used R packages (R Development Core Team (2010)) glmpath (Park and Hastie (2007)) and glmnet (Friedman, Hastie, and Tibshirani (2010)). designed for LASSO and Elastic Net.
The remainder of the paper is arranged as follows. In Section 2 the GLM is briefly reviewed and its properties in the small-n-large-P framework are in- vestigated. In Section 3, EBIC for GLM is introduced and its consistency is established. Simulation studies are reported in Section 4. A data example is analyzed in Section 5, and we put some technical proofs and other information in an Appendix.
Let Y be a response variable and x a vector of feature variables (hereafter, for convenience, the variables are called features). A GLM consists of three components. The first component is an exponential family distribution assumed for Y , with density functionf (y; θ) = exp{θτ y − b(θ)} (2.1)with respect to a σ-finite measure ν. The parameter θ is called the natural parameter and the set Θ = {θ : ∫ exp{θτ y}dν < ∞} .is called the natural parameter space. The exponential family has the properties:
(a) the natural parameter space Θ is convex; (b) at any interior point of Θ, b(θ) has all derivatives and b′(θ) = E(Y ) ≡ µ , b′′(θ) = Var (Y ) ≡ σ2; (c) at any interior point of Θ, the moment generating function of the family exists and is given by M (t) = exp{b(θ + t) − b(θ)}. The second component of the GLM is a linear predictor given by η = x β; that is, the GLM assumes that the features affect the distribution of Y through this linear form. The third component of the GLM is a link function g that relates the mean µ to the linear predictor by g(µ) = η = xτ β.
We investigate the feature selection problems given a random sample {(yi, xi) : i = 1, . . . , n} with two characteristics: (i) small-n-large-P , the number of features is much larger than the sample size; and (ii) sparsity, only a few un-identified features affect Y . We refer to a GLM with these two characteristics as the small- n-large-P sparse GLM.
CS 340 Milestone One Guidelines and Rubric Overview: For this assignment, you will implement the fundamental operations of create, read, update,
Retail Transaction Programming Project Project Requirements: Develop a program to emulate a purchase transaction at a retail store. This
7COM1028 Secure Systems Programming Referral Coursework: Secure
Create a GUI program that:Accepts the following from a user:Item NameItem QuantityItem PriceAllows the user to create a file to store the sales receip
CS 340 Final Project Guidelines and Rubric Overview The final project will encompass developing a web service using a software stack and impleme