Assignment 2: Modeling, Uncertainties and Feature Importance
M Loecher
source("utils.R")
1 Introductory Comments
This assignment serves two purposes. First, it will train your modeling and data manipulation skills, in particular, linear regression and summarizing data the dplyr way. Second, it reinforces your computational statistics (or: understanding statistics by simulation) competencies by comparing theoretical standard errors to those obtained by simulation. And as a bonus, you get to explore a cool data set and learn about feature importance.
You should work in groups (ideally about 3 students per group). Each group must submit at least one R-file containing well-documented functions and test cases to test the functions. You may use two files (one for the functions and another one for the testcases), but this is not necessary. Write your answers and explanations as comments into the R-File. We strongly encourage you to submit an Rmd-file (plus its compiled version) instead of an R-File.
2 Data
You need to download the Bike Sharing Dataset from the UCI Machine Learning Repository http://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset, read in the day.csv data and preprocess your data as follows.
bikes <- read.csv("Bike-Sharing-Dataset/day.csv", stringsAsFactors = FALSE)
#bikes$days_since_2011 = as.numeric(as.Date(bikes$dteday)-as.Date("2011-01-01"))
bike.features.of.interest = c("season","holiday","workingday", "weathersit","temp", "hum", "windspeed", "days_since_2011", "cnt") # colnames(bike)[c(1,4,6,7,8,9,10,12)]
bikes = clean.bike.data(bikes)[,bike.features.of.interest]
options(digits=2)
#datatable(bikes[1:50,c(bike.features.of.interest, "cnt")])
kable(bikes[1:5,])
season holiday workingday weathersit temp hum windspeed days_since_2011 cnt
WINTER NO HOLIDAY NO WORKING DAY MISTY 8.2 81 11 0 985
WINTER NO HOLIDAY NO WORKING DAY MISTY 9.1 70 17 1 801
WINTER NO HOLIDAY WORKING DAY GOOD 1.2 44 17 2 1349
WINTER NO HOLIDAY WORKING DAY GOOD 1.4 59 11 3 1562
WINTER NO HOLIDAY WORKING DAY GOOD 2.7 44 13 4 1600
Create a random subset of the data, which leaves a “hold out data set” for testing
set.seed(123) nTrain=round(nrow(bikes)/2) ranRows = sample(nrow(bikes),nTrain) train = bikes[ranRows, ] test = bikes[-ranRows, ]
3 Data Summaries
(2 points) Using dplyr (group_by() , summarise() ) and ggplot2:
Compute the average bike rental counts as a function of
weathersit
workingday
all combinations of weathersit and workingday
(2 points) Repeat the above using the function lm() only.
4 Standard Errors in Linear Regression
In this example, we use the linear regression model to predict the number of rented bikes on a particular day, given weather and calendar information. For the interpretation, we examine the estimated regression weights. The features consist of numerical and categorical features. For each feature, the table shows the estimated weight, the standard error of the estimate (SE), and the absolute value of the t-statistic (|t|).
#data(bike)
mod = lm(cnt ~ ., data = train, x = TRUE)
lm_summary = summary(mod)$coefficients
lm_summary[,'t value'] = abs(lm_summary[,'t value'])
rownames(lm_summary) = pretty_rownames(rownames(lm_summary))
kable(lm_summary[,c('Estimate', 'Std. Error', 't value')], digits = 1, col.names = c('Weight', 'SE', "|t|"))
CS 340 Milestone One Guidelines and Rubric Overview: For this assignment, you will implement the fundamental operations of create, read, update,
Retail Transaction Programming Project Project Requirements: Develop a program to emulate a purchase transaction at a retail store. This
7COM1028 Secure Systems Programming Referral Coursework: Secure
Create a GUI program that:Accepts the following from a user:Item NameItem QuantityItem PriceAllows the user to create a file to store the sales receip
CS 340 Final Project Guidelines and Rubric Overview The final project will encompass developing a web service using a software stack and impleme