source("utils.R")

1 Introductory Comments

This assignment serves two purposes. First, it will train your modeling and data manipulation skills, in particular linear regression and summarizing data the dplyr way. Second, it reinforces your computational statistics (or: understanding statistics by simulation) competencies by comparing theoretical standard errors to those obtained by simulation. And as a bonus, you get to explore a cool data set and learn about feature importance.

You should work in groups (ideally about 3 students per group). Each group must submit at least one R-file containing well-documented functions and test cases to test the functions. You may use two files (one for the functions and another one for the testcases), but this is not necessary. Write your answers and explanations as comments into the R-File. We strongly encourage you to submit an Rmd-file (plus its compiled version) instead of an R-File.

This home assignment has 16 points (=20%) and is due Sunday, July 11, 6PM

2 Data

You need to download the Bike Sharing Dataset from the UCI Machine Learning Repository http://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset , read in the day.csv data and preprocess your data as follows.

bikes <- read.csv("Bike-Sharing-Dataset/day.csv", stringsAsFactors = FALSE)
#bikes$days_since_2011 = as.numeric(as.Date(bikes$dteday)-as.Date("2011-01-01"))
bike.features.of.interest = c("season","holiday","workingday", "weathersit","temp", "hum", "windspeed", "days_since_2011", "cnt") # colnames(bike)[c(1,4,6,7,8,9,10,12)]

bikes = clean.bike.data(bikes)[,bike.features.of.interest]
options(digits=2)
#datatable(bikes[1:50,c(bike.features.of.interest, "cnt")])
kable(bikes[1:5,])
season holiday workingday weathersit temp hum windspeed days_since_2011 cnt
WINTER NO HOLIDAY NO WORKING DAY MISTY 8.2 81 11 0 985
WINTER NO HOLIDAY NO WORKING DAY MISTY 9.1 70 17 1 801
WINTER NO HOLIDAY WORKING DAY GOOD 1.2 44 17 2 1349
WINTER NO HOLIDAY WORKING DAY GOOD 1.4 59 11 3 1562
WINTER NO HOLIDAY WORKING DAY GOOD 2.7 44 13 4 1600

Create a random subset of the data, which leaves a “hold out data set” for testing

set.seed(123)
nTrain=round(nrow(bikes)/2)
ranRows = sample(nrow(bikes),nTrain)

train = bikes[ranRows, ]
test = bikes[-ranRows, ]

3 Data Summaries

  1. (2 points) Using dplyr (group_by() , summarise() ) and ggplot2:
  1. (2 points) Repeat the above using the function lm() only.

4 Standard Errors in Linear Regression

In this example, we use the linear regression model to predict the number of rented bikes on a particular day, given weather and calendar information. For the interpretation, we examine the estimated regression weights. The features consist of numerical and categorical features. For each feature, the table shows the estimated weight, the standard error of the estimate (SE), and the absolute value of the t-statistic (|t|).

#data(bike)

mod = lm(cnt ~ ., data = train, x = TRUE)
lm_summary = summary(mod)$coefficients

lm_summary[,'t value'] = abs(lm_summary[,'t value'])
rownames(lm_summary) = pretty_rownames(rownames(lm_summary))

kable(lm_summary[,c('Estimate', 'Std. Error', 't value')], digits = 1, col.names = c('Weight', 'SE', "|t|"))
Weight SE |t|
(Intercept) 2192 319.1 6.9
seasonSPRING 1182 166.0 7.1
seasonSUMMER 629 219.0 2.9
seasonFALL 534 153.2 3.5
holidayHOLIDAY -544 337.5 1.6
workingdayWORKING DAY 179 97.9 1.8
weathersitMISTY -410 115.3 3.6
weathersitRAIN/SNOW/STORM -2148 302.8 7.1
temp 84 9.3 9.0
hum -12 4.1 2.8
windspeed -42 9.4 4.5
days_since_2011 5 0.2 21.1

The information of the weight table (weight and variance estimates) can be visualized in a weight plot. The following plot shows the results from the previous linear regression model.

coef_plot(mod) + scale_y_discrete("")
Weights are displayed as points and the 95% confidence intervals as lines.

Weights are displayed as points and the 95% confidence intervals as lines.

4.1 Tasks

  1. (1 point) Predict the number of rented bikes for rows \(1\) and \(150\) of the test data
season holiday workingday weathersit temp hum windspeed days_since_2011
WINTER NO HOLIDAY NO WORKING DAY MISTY 8.2 81 11 0
FALL NO HOLIDAY WORKING DAY GOOD 14.4 64 28 292
  1. (7 points) The “standard errors” of the coefficients quantifiy the uncertainty of the estimated coefficients. Write two functions:
  • simCoefs simulates the process of repeatedly (\(M=500\)) fitting a linear model to random subsets of your data of size 366 and stores the estimated slopes in a matrix,
  • visCoefs draws histograms of the obtained coefficient estimates, adds vertical lines at the corresponding estimates from lm_summary and computes summary statistics. You can choose between base R graphics (example shown below) and ggplot.

Comment on the agreement with the estimated standard errors.

simCoefs = function(dat=bikes, nTrain,lm_summary, M=500){
coefs = matrix(NA,ncol=nrow(lm_summary),nrow=M)
colnames(coefs) = rownames(lm_summary)

  
  return(coefs)
}


visCoefs= function(coefs,lm_summary){
  
  m = #vector of mean of simulated coefficients
  s = #vector of standard deviations of simulated coefficients
  return(list(mean=m,stdev=s))
}

coefs = simCoefs(bikes, nTrain,lm_summary,500)
ms = visCoefs(coefs, lm_summary)

(Intercept) seasonSPRING seasonSUMMER seasonFALL holidayHOLIDAY workingdayWORKING DAY weathersitMISTY weathersitRAIN/SNOW/STORM temp hum windspeed days_since_2011
2385 890 132 420 -689 127 -386 -1909 111.0 -17 -42.0 4.94
238 126 171 125 294 86 82 271 8.2 3 7.8 0.22

5 Permutation Importance

This task will introduce you the concept of measuring the importance of features by randomly shuffling each column and measuring the drop in prediction accuracy.

The concept is really straightforward: We measure the importance of a feature by calculating the increase in the model’s prediction error after permuting the feature. A feature is “important” if shuffling its values increases the model error, because in this case the model relied on the feature for the prediction. A feature is “unimportant” if shuffling its values leaves the model error unchanged, because in this case the model ignored the feature for the prediction.

In particular:

Record a baseline accuracy (classifier) or \(R^2\) score (regressor) by passing a validation set through the model. Permute the column values of a single predictor feature and then pass all test samples back through the Random Forest and recompute the accuracy or \(R^2\). The importance of that feature is the difference between the baseline and the drop in overall accuracy or \(R^2\) caused by permuting the column.

  1. (4 points) Write a function which computes this permutation importance using the test set that we created earlier.
permImp = function(mod, dat=test){
  p=ncol(dat)
  impScores=rep(NA,p-1)
  names(impScores) = colnames(dat)[1:(p-1)]
  
  
  
  return(impScores)
}

set.seed(321)
impScores=permImp(mod,test)
par(mar=c(4,8,4,2))
barplot(impScores,horiz = TRUE,col="darkblue",xlab="score", main ="permutation importance",las=2)