source("utils.R")
This assignment serves two purposes. First, it will train your modeling and data manipulation skills, in particular linear regression and summarizing data the dplyr way. Second, it reinforces your computational statistics (or: understanding statistics by simulation) competencies by comparing theoretical standard errors to those obtained by simulation. And as a bonus, you get to explore a cool data set and learn about feature importance.
You should work in groups (ideally about 3 students per group). Each group must submit at least one R-file containing well-documented functions and test cases to test the functions. You may use two files (one for the functions and another one for the testcases), but this is not necessary. Write your answers and explanations as comments into the R-File. We strongly encourage you to submit an Rmd-file (plus its compiled version) instead of an R-File.
This home assignment has 16 points (=20%) and is due Sunday, July 11, 6PM
You need to download the Bike Sharing Dataset from the UCI Machine Learning Repository http://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset , read in the day.csv data and preprocess your data as follows.
bikes <- read.csv("Bike-Sharing-Dataset/day.csv", stringsAsFactors = FALSE)
#bikes$days_since_2011 = as.numeric(as.Date(bikes$dteday)-as.Date("2011-01-01"))
bike.features.of.interest = c("season","holiday","workingday", "weathersit","temp", "hum", "windspeed", "days_since_2011", "cnt") # colnames(bike)[c(1,4,6,7,8,9,10,12)]
bikes = clean.bike.data(bikes)[,bike.features.of.interest]
options(digits=2)
#datatable(bikes[1:50,c(bike.features.of.interest, "cnt")])
kable(bikes[1:5,])
| season | holiday | workingday | weathersit | temp | hum | windspeed | days_since_2011 | cnt |
|---|---|---|---|---|---|---|---|---|
| WINTER | NO HOLIDAY | NO WORKING DAY | MISTY | 8.2 | 81 | 11 | 0 | 985 |
| WINTER | NO HOLIDAY | NO WORKING DAY | MISTY | 9.1 | 70 | 17 | 1 | 801 |
| WINTER | NO HOLIDAY | WORKING DAY | GOOD | 1.2 | 44 | 17 | 2 | 1349 |
| WINTER | NO HOLIDAY | WORKING DAY | GOOD | 1.4 | 59 | 11 | 3 | 1562 |
| WINTER | NO HOLIDAY | WORKING DAY | GOOD | 2.7 | 44 | 13 | 4 | 1600 |
Create a random subset of the data, which leaves a “hold out data set” for testing
set.seed(123)
nTrain=round(nrow(bikes)/2)
ranRows = sample(nrow(bikes),nTrain)
train = bikes[ranRows, ]
test = bikes[-ranRows, ]
dplyr (group_by() , summarise() ) and ggplot2:weathersitworkingdayweathersit and workingdaylm() only.In this example, we use the linear regression model to predict the number of rented bikes on a particular day, given weather and calendar information. For the interpretation, we examine the estimated regression weights. The features consist of numerical and categorical features. For each feature, the table shows the estimated weight, the standard error of the estimate (SE), and the absolute value of the t-statistic (|t|).
#data(bike)
mod = lm(cnt ~ ., data = train, x = TRUE)
lm_summary = summary(mod)$coefficients
lm_summary[,'t value'] = abs(lm_summary[,'t value'])
rownames(lm_summary) = pretty_rownames(rownames(lm_summary))
kable(lm_summary[,c('Estimate', 'Std. Error', 't value')], digits = 1, col.names = c('Weight', 'SE', "|t|"))
| Weight | SE | |t| | |
|---|---|---|---|
| (Intercept) | 2192 | 319.1 | 6.9 |
| seasonSPRING | 1182 | 166.0 | 7.1 |
| seasonSUMMER | 629 | 219.0 | 2.9 |
| seasonFALL | 534 | 153.2 | 3.5 |
| holidayHOLIDAY | -544 | 337.5 | 1.6 |
| workingdayWORKING DAY | 179 | 97.9 | 1.8 |
| weathersitMISTY | -410 | 115.3 | 3.6 |
| weathersitRAIN/SNOW/STORM | -2148 | 302.8 | 7.1 |
| temp | 84 | 9.3 | 9.0 |
| hum | -12 | 4.1 | 2.8 |
| windspeed | -42 | 9.4 | 4.5 |
| days_since_2011 | 5 | 0.2 | 21.1 |
The information of the weight table (weight and variance estimates) can be visualized in a weight plot. The following plot shows the results from the previous linear regression model.
coef_plot(mod) + scale_y_discrete("")
Weights are displayed as points and the 95% confidence intervals as lines.
| season | holiday | workingday | weathersit | temp | hum | windspeed | days_since_2011 |
|---|---|---|---|---|---|---|---|
| WINTER | NO HOLIDAY | NO WORKING DAY | MISTY | 8.2 | 81 | 11 | 0 |
| FALL | NO HOLIDAY | WORKING DAY | GOOD | 14.4 | 64 | 28 | 292 |
Comment on the agreement with the estimated standard errors.
simCoefs = function(dat=bikes, nTrain,lm_summary, M=500){
coefs = matrix(NA,ncol=nrow(lm_summary),nrow=M)
colnames(coefs) = rownames(lm_summary)
return(coefs)
}
visCoefs= function(coefs,lm_summary){
m = #vector of mean of simulated coefficients
s = #vector of standard deviations of simulated coefficients
return(list(mean=m,stdev=s))
}
coefs = simCoefs(bikes, nTrain,lm_summary,500)
ms = visCoefs(coefs, lm_summary)
| (Intercept) | seasonSPRING | seasonSUMMER | seasonFALL | holidayHOLIDAY | workingdayWORKING DAY | weathersitMISTY | weathersitRAIN/SNOW/STORM | temp | hum | windspeed | days_since_2011 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 2385 | 890 | 132 | 420 | -689 | 127 | -386 | -1909 | 111.0 | -17 | -42.0 | 4.94 |
| 238 | 126 | 171 | 125 | 294 | 86 | 82 | 271 | 8.2 | 3 | 7.8 | 0.22 |
This task will introduce you the concept of measuring the importance of features by randomly shuffling each column and measuring the drop in prediction accuracy.
The concept is really straightforward: We measure the importance of a feature by calculating the increase in the model’s prediction error after permuting the feature. A feature is “important” if shuffling its values increases the model error, because in this case the model relied on the feature for the prediction. A feature is “unimportant” if shuffling its values leaves the model error unchanged, because in this case the model ignored the feature for the prediction.
In particular:
Record a baseline accuracy (classifier) or \(R^2\) score (regressor) by passing a validation set through the model. Permute the column values of a single predictor feature and then pass all test samples back through the Random Forest and recompute the accuracy or \(R^2\). The importance of that feature is the difference between the baseline and the drop in overall accuracy or \(R^2\) caused by permuting the column.
permImp = function(mod, dat=test){
p=ncol(dat)
impScores=rep(NA,p-1)
names(impScores) = colnames(dat)[1:(p-1)]
return(impScores)
}
set.seed(321)
impScores=permImp(mod,test)
par(mar=c(4,8,4,2))
barplot(impScores,horiz = TRUE,col="darkblue",xlab="score", main ="permutation importance",las=2)