STAT0030 Assessment 2 — Instructions
For this assessment you should submit online – on the course Moodle page using the link “ICA2: Click here to submit your assignment”. Make sure none of the files contains your surname, as the marking must be anonymous. You must submit two files:
• An electronic copy of your StudentNumber.rmd file, containing your R markdown code. For example, if your student number is 18239004, your R markdown script should be saved in the file 18239004.rmd.
• A single PDF file named StudentNumber.pdf containing the knitted output of the Rmarkdown file. This should correspond exactly to what is produced when knitting the submitted .rmd file.
Any output within your pdf should be clearly presented and structured according to the question parts. Your report (including the graphics but excluding the hidden code) should not exceed 5 pages.
STAT0030 Assessment 2 – Marking guidelines
The assessment is marked out of 40. The marks are roughly subdivided into the following components.
1. Exploratory analysis (5 marks): investigation and commentary of initial statistical properties, relationships, and anything of note which helps justify your choice of graphs and modelling strategy.
2. Graphical presentation (5 marks): appropriate choice of graphs and formatting.
3. Modelling strategy (10 marks): marks here will be based on a structured, justified, wellprincipled approach with clear and concise discussion.
4. Interpretation of final model (10 marks): comparison of the two final models and commentary on their quality. 5. Quality of the code (10 marks): your code should be clean, readable (with sufficient commenting for the user) and efficient.
STAT0030 Assessment 2 — Questions
1 Introduction to Ridge Regression
In Lab 4 we found out how to fit linear models in R. Recall that linear models take the form yi = β0 + β1xi1 + . . . + βpxip + i , for i = 1, . . . , n. Here,
• yi is the value of the response (or dependent) variable for the ith case in the dataset.
• xij is the value of the jth explanatory variable or covariate for that case.
• β0, . . . , βp are parameters.
• 1, . . . , n are independent error terms with zero mean, assumed to have constant variance and to be normally distributed (unless otherwise stated). The coefficients β0, β1, . . . , βp are usually estimated by minimising the residual sum of squares, RSS = Xn i=1 (yi − β0 − β1xi1 − . . . − βpxip) 2 . The resulting estimator is called the Least Squares Estimator, which also happens to be the maximum likelihood estimator in the case when the errors are assumed to be normally distributed with variance σ 2 .
However, the least squares estimator can overfit, especially when a large number of covariates are used. This means that the linear fit will pick up random noise in the observed data and will not be a good predictor of future observations. The simplest way of dealing with this issue is best subset selection, where any possible subset of the covariates is used to fit a linear model and compared in terms of some model selection criterion (for example, using the Akaike Information Criterion or predictive power through cross-validation). Unfortunately, best subset selection is computationally prohibitive for large numbers of covariates. Instead, stepwise regression is often used, where covariates are iteratively added or removed from the model according to their p-value. However, stepwise regression is sensitive to the order in which covariates are added or removed and is not guaranteed to result in the best overall subset of covariates. An alternative approach to this problem is penalised regression. In penalised regression, the objective function includes a penalty term to the residual sum of squares which represents a “cost” for large values of regression coefficients. The simplest form of penalisation is the L2 norm of the coefficient vector, also called Ridge penalty. The loss function in Ridge Regression is then given by Lridge = Xn i=1 (yi − β0 − β1xi1 − . . . − βpxip) 2 + λ X p i=1 β 2 i , 2 which is minimised with respect to the coefficient vector β to obtain penalised parameter estimates. Here, λ is a tuning parameter which represents the level of regularisation: a value of λ = 0 represents no regularisation (resulting in the standard least squares estimators), whereas a value of λ = ∞ corresponds to total regularisation, i.e., all coefficients except β0 are forced to 0. The optimal value of λ is typically chosen by setting up a grid of s values λ1, . . . , λs and computing cross-validated performance (for example, in terms of mean squared error) for the model fit resulting from each value of λ. The corresponding fit will be sensitive to the scale of each covariate, thus scaling of the covariates prior to model fitting is often applied. One key advantage of the ridge penalty is that the optimisation of the loss function is still convex and thus computationally simple. One disadvantage is that ridge regression will always include all covariates in the model, so it does not perform any variable selection. Your task for this assignment will be to write R code to compute ridge regression coefficient estimates for a given dataset (see detais below), and apply it to a dataset of Covid-19 case numbers from the UK’s first pandemic wave (see details below).
CS 340 Milestone One Guidelines and Rubric Overview: For this assignment, you will implement the fundamental operations of create, read, update,
Retail Transaction Programming Project Project Requirements: Develop a program to emulate a purchase transaction at a retail store. This
7COM1028 Secure Systems Programming Referral Coursework: Secure
Create a GUI program that:Accepts the following from a user:Item NameItem QuantityItem PriceAllows the user to create a file to store the sales receip
CS 340 Final Project Guidelines and Rubric Overview The final project will encompass developing a web service using a software stack and impleme