Ask This Question To Be Solved By Our ExpertsGet A+ Grade Solution Guaranteed

Drop Files Here Or Click to Upload

Or Get Complete Course Help

Marvin BittleBusiness

(5/5)

787 Answers

Hire Me

Glenn BolyardGeneral article writing

(5/5)

764 Answers

Hire Me

Lynette WhiteGeneral article writing

(5/5)

890 Answers

Hire Me

Shalu KashyapComputer science

(5/5)

751 Answers

Hire Me

R Programming

(5/5)

In many scientiﬁc investigations, researchers explore the relationship between a response variable and some explanatory features through a random sample.

INSTRUCTIONS TO CANDIDATES

ANSWER ALL QUESTIONS

Abstract:

The small-n-large-P situation has become common in genetics research, medical studies, risk management, and other ﬁelds. Feature selection is crucial in these studies yet poses a serious challenge. The traditional criteria such as AIC, BIC, and cross-validation choose too many features. In this paper, we examine the variable selection problem under the generalized linear models. We study the approach where a prior takes speciﬁc account of the small-n-large-P situation. The criterion is shown to be variable selection consistent under generalized linear models. We also report simulation results and a data analysis to illustrate the eﬀectiveness of EBIC for feature selection.

Keywords and phrases: Consistency, exponential family, extended Bayes information criterion, feature selection, generalized linear model, small-n-large-P.

1. Introduction

In many scientiﬁc investigations, researchers explore the relationship between a response variable and some explanatory features through a random sample. Examples of such features include disease genes and quantitative trait loci in the human genome, biomarkers responsible for disease pathways, and stocks generating proﬁts in investment portfolios. The selection of causal features is a crucial aspect in this. When the sample size n is relatively small but the number of features P under consideration is extremely large, there is a serious challenge to the selection of causal features. Feature selection in the sense of identifying causal features is diﬀerent from, but often interwoven with, model selection; the latter involves two operational components: a procedure for selecting candidate models, and a criterion for assessing the candidate models. In this article, we concentrate on the issue of model selection criteria.

Traditional model selection criteria such as Akaike’s information criterion (AIC) (Akaike (1973)), cross-validation (CV) (Stone (1974)) and generalized cross-validation (GCV) (Craven and Wahba (1979)) essentially address the pre- diction accuracy of selected models. The popular Bayes information criterion (BIC) (Schwarz (1978)) was developed from the Bayesian paradigm in a diﬀer- ent vein. BIC approximates the posterior model probability when the prior is uniform on the model space. However, in the small-n-large-P situation, these cri- teria become overly liberal and fail to serve the purpose of feature selection. This phenomenon has been observed by Broman and Speed (2002), Siegmund (2004), and Bogdan, Doerge and Ghosh (2004) in genetic studies. See also Donoho (2000), Singh et al. (2002), Marchini, Donnelly, and Cardon (2005), Clayton et al. (2005), Fan and Li (2006), Zhang and Huang (2008), and Hoh, Wille, and Ott (2008). Some recent BIC related model selection procedures in new situations can be found in Wang, Li, and Tsai (2007), Jiang et al. (2008) and many others.

Recently, Chen and Chen (2008) pointed out that the uniform prior on the model space is the cause of BIC’s liberality in small-n-large-P situation. Cor- rection of this problem leads to a family of extended Bayes information criteria (EBIC). Bogdan, Doerge and Ghosh (2004) made the same observation but pro- vided slightly diﬀerent correction measures. Mathematically, the EBIC is the classical BIC with an additional penalty term 2γ log P with a positive γ. Inter- estingly, Foster and George (1994) found that, instead of adding the log P term, simply replacing log n with 2 log P in BIC gives empirically optimal results in view of risk inﬂation, and their ﬁnding was echoed in Abramovich et al. (2006). The EBIC is shown to be selection consistent in the small-n-large-P frame- work under the normal linear model. Its validity under a wide class of regression models remains an unsolved problem. In this paper, we have tailor-developed technical results for exponential family distributions which are of interest in themselves. They are particularly useful in proving the uniform consistency of the maximum likelihood estimates of the coeﬃcients in the linear predictor of all generalized linear models (GLM) containing causal features (Theorem 1), and the selection consistency of EBIC under GLM with canonical links (Theorem 2). In the implementation, we need to place an upper bound K on the number of causal features. If K is chosen too small in an application, the selection consis- tency of the procedure with EBIC may not be realized. To tackle this issue, we show that if K is chosen too small the EBIC will select a model exhausting all K features. If the selected model exhausts all the K features allocated, reanalyzing the data with a larger K is suggested.

We have also investigated the performance of EBIC by simulation under the logistic regression model and the Poisson log-linear model. The logistic regression model is valid in both prospective and retrospective studies, see McCullagh and Nelder (1989, Chap. 4), and is a major approach in genetic research, see for ex- ample The Wellcome Trust Case-Control Consortium (2007). In principle, EBIC is an all subsets method which is computationally infeasible. Our implementa- tion strategy for EBIC follows that of Wang, Li, and Tsai (2007) and Zhang, Li, and Tsai (2010). We use regularization methods such as LASSO (Tibshirani (1996)), SCAD (Fan and Li (2001)) or Elastic Net (Zou and Hastie (2005)) to obtain regression models with various levels of sparsity. Because they only deter- mine the order of the penalty level for selection consistency, some cross validation procedure is ultimately used to select the ﬁnal model. Replacing the computer intensive cross validation procedure by EBIC creates a promising new approach. In simulation, we used R packages (R Development Core Team (2010)) glmpath (Park and Hastie (2007)) and glmnet (Friedman, Hastie, and Tibshirani (2010)). designed for LASSO and Elastic Net.

The remainder of the paper is arranged as follows. In Section 2 the GLM is brieﬂy reviewed and its properties in the small-n-large-P framework are in- vestigated. In Section 3, EBIC for GLM is introduced and its consistency is established. Simulation studies are reported in Section 4. A data example is analyzed in Section 5, and we put some technical proofs and other information in an Appendix.

2. The Small-n-Large-P Sparse GLM and Its Asymptotic Properties

Let Y be a response variable and x a vector of feature variables (hereafter, for convenience, the variables are called features). A GLM consists of three components. The ﬁrst component is an exponential family distribution assumed for Y , with density functionf (y; θ) = exp{θτ y − b(θ)} (2.1)with respect to a σ-ﬁnite measure ν. The parameter θ is called the natural parameter and the set Θ = {θ : ∫ exp{θτ y}dν < ∞} .is called the natural parameter space. The exponential family has the properties:

(a) the natural parameter space Θ is convex; (b) at any interior point of Θ, b(θ) has all derivatives and b′(θ) = E(Y ) ≡ µ , b′′(θ) = Var (Y ) ≡ σ2; (c) at any interior point of Θ, the moment generating function of the family exists and is given by M (t) = exp{b(θ + t) − b(θ)}. The second component of the GLM is a linear predictor given by η = x β; that is, the GLM assumes that the features aﬀect the distribution of Y through this linear form. The third component of the GLM is a link function g that relates the mean µ to the linear predictor by g(µ) = η = xτ β.

We investigate the feature selection problems given a random sample {(yi, xi) : i = 1, . . . , n} with two characteristics: (i) small-n-large-P , the number of features is much larger than the sample size; and (ii) sparsity, only a few un-identiﬁed features aﬀect Y . We refer to a GLM with these two characteristics as the small- n-large-P sparse GLM.

(5/5)

Hurry, Grab up to 30% discount on the entire course

Ask This Question To Be Solved By Our ExpertsGet A+ Grade Solution Guaranteed

Our Experts

Marvin BittleBusiness

Glenn BolyardGeneral article writing

Lynette WhiteGeneral article writing

Shalu KashyapComputer science

R Programming

In many scientiﬁc investigations, researchers explore the relationship between a response variable and some explanatory features through a random sample.

ANSWER ALL QUESTIONS

Abstract:

1. Introduction

2. The Small-n-Large-P Sparse GLM and Its Asymptotic Properties

Attachments:

Instructions Files

Related Questions

. The fundamental operations of create, read, update, and delete (CRUD) in either Python or Java

. Develop a program to emulate a purchase transaction at a retail store. This program will have two classes, a LineItem class and a Transaction class

. The following program contains five errors. Identify the errors and fix them

. Accepts the following from a user: Item Name Item Quantity Item Price Allows the user to create a file to store the sales receipt contents

. The final project will encompass developing a web service using a software stack and implementing an industry-standard interface. Regardless of whether you choose to pursue application development goals as a pure developer or as a software engineer

Other Services

Ask This Question To Be Solved By Our ExpertsGet A+ Grade Solution Guaranteed

Our Experts

Marvin BittleBusiness

Glenn BolyardGeneral article writing

Lynette WhiteGeneral article writing

Shalu KashyapComputer science

R Programming

In many scientiﬁc investigations, researchers explore the relationship between a response variable and some explanatory features through a random sample.

ANSWER ALL QUESTIONS

Abstract:

1. Introduction

2. The Small-n-Large-P Sparse GLM and Its Asymptotic Properties

Attachments:

Instructions Files

Related Questions

. The fundamental operations of create, read, update, and delete (CRUD) in either Python or Java

. Develop a program to emulate a purchase transaction at a retail store. This program will have two classes, a LineItem class and a Transaction class

. The following program contains five errors. Identify the errors and fix them

. Accepts the following from a user: Item Name Item Quantity Item Price Allows the user to create a file to store the sales receipt contents

. The final project will encompass developing a web service using a software stack and implementing an industry-standard interface. Regardless of whether you choose to pursue application development goals as a pure developer or as a software engineer