We consider the 1992 baseball salary data set, , which is available from
This data set (of dimension 337 × 18 ) contains salary information (and performance measures) of 337 Major League Baseball players in 1992. More detailed information can be found at
The data set contains the following variables.
Table 1: Variable Description for the 1992 Baseball Salary Data
Var |
Columns |
Description |
salary |
1 – 4 |
Salary (in thousands of dollars) |
X1 |
6 – 10 |
Batting average |
X2 |
12 – 16 |
On-base percentage (OBP) |
X3 |
18 – 20 |
Number of runs |
X4 |
22 – 24 |
Number of hits |
X5 |
26 – 27 |
Number of doubles |
X6 |
29 – 30 |
Number of triples |
X7 |
32 – 33 |
Number of home runs |
X8 |
35 – 37 |
Number of runs batted in (RBI) |
X9 |
39 – 41 |
Number of walks |
X10 |
43 – 45 |
Number of strike-outs |
X11 |
47 – 48 |
Number of stolen bases |
X12 |
50 – 51 |
Number of errors |
X13 |
53 |
Indicator of “free agency eligibility” |
X14 |
55 |
Indicator of “free agent in 1991/2” |
X15 |
57 |
Indicator of “arbitration eligibility” |
X16 |
59 |
Indicator of “arbitration in 1991/2” |
ID |
61 – 79 |
Player’s name (in quotation marks) |
The data set can be input into R by reading directly from the website, with the following R commands:
baseball <- read.table(file= "", header = F,
col.names=c("salary", "x1", "x2", "x3", "x4", "x5",
"x6", "x7","x8", "x9", "x10", "x11", "x12", "x13", "x14", "x15", "x16", "ID"))
baseball$logsalary <- log(baseball$salary);
baseball <- baseball[, -c(1, 18)] # REMOVE salary AND ID head(baseball)
Complete the project by following the specific instructions given below.
1. Perform EDA on the data (exclude: x13 − x16) and describe any interesting observation.
2. Starting with the whole model that includes all predictors (i.e., X1, X2, . . . , X16), apply one model selection procedure of your choice to select your best model. Provide the fitting results from your ‘best’ model, i.e., the table of Parameter Estimates and the ANOVA table.
3. Suppose that we are interested in the following model:
Model I: log(salary) = β0 + β1x13 + β2x15 + β3x3 + β4x4 + ε. (1)
Fit Model I in (1) and output the two tables: Table of Parameter Estimates and the ANOVA table.
4. Use BIC to compare the best model that you found with Model I in (1). Which one is better according to BIC? (Hint: May use the R function BIC.)
5. Perform a test of H0 : β3 = β4 = 0 in Model I.
6. Given a player who has 80 runs (X3 = 80) and 120 hits (X4 = 120) in total and is eligible for arbitration (X15 = 1), but not free agency yet (X13 = 0), provide a 95% prediction interval for his salary based on Model I.
Some helpful tips for computer projects are listed below:
• Start early and don’t wait till the last day/minute;
Create a table for all output presented.Use copy-and-paste appropriately to include necessary R output into your final report;
• Remember to interpret every result that you present; • Place your R codes in an appendix.
• Submit your report and code in OneNote.
CS 340 Milestone One Guidelines and Rubric Overview: For this assignment, you will implement the fundamental operations of create, read, update,
Retail Transaction Programming Project Project Requirements: Develop a program to emulate a purchase transaction at a retail store. This
7COM1028 Secure Systems Programming Referral Coursework: Secure
Create a GUI program that:Accepts the following from a user:Item NameItem QuantityItem PriceAllows the user to create a file to store the sales receip
CS 340 Final Project Guidelines and Rubric Overview The final project will encompass developing a web service using a software stack and impleme