This assignment is worth 70% of the total marks available for this module. If coursework is submitted late (and where there are no extenuating circumstances):
1 - If the assessment is submitted no later than 24 hours after the deadline, the mark for the assessment will be capped at the minimum pass mark;
2 - If the assessment is submitted more than 24 hours after the deadline, a mark of 0 will be given for the assessment.
Your submission must include the official Coursework Submission Cover sheet, which can be found here:
Your coursework should be submitted via Learning Central by the above deadline. You have to upload the following files:
Description Type Name
Cover sheet Compulsory One PDF (.pdf) file Student_number.pdf
Your solution to Part 1 and Part 2 Compulsory One Jupyter Notebook (.ipynb) file Part1_2.ipynb
HTML version of part1_2.ipynb Compulsory One HTML (.html) file Part1_2.html
Your solution to Part 3 Compulsory One Jupyter Notebook (.ipynb) file Part3.ipynb
HTML version of part3.ipynb Compulsory One HTML (.html) file Part3.html
For the filename of the Cover Sheet, replace ‘Student_number’ by your student number, e.g. “C1234567890.pdf”. Make sure to include your student number as a comment in all of the Python files! Any deviation from the submission instructions (including the number and types of files submitted) may result in a reduction of marks for the assessment or question part.
You can submit multiple times on Learning Central. ONLY files contained in the last attempt will be marked, so make sure that you upload all files in the last attempt.
Staff reserve the right to invite students to a meeting to discuss the Coursework submissions.
Start by downloading the following files from the Assessment tab in Learning Central:
• Part1_2.ipynb (Jupyter Notebook file for Part 1 and Part 2)
• Part3.ipynb (Jupyter Notebook file for Part 3)
• listings.csv
• reviews.csv
Then answer the following questions. You can use any Python expression or package that was used in the lectures and practical sessions. Additional packages are not allowed unless instructed in the question. You answer the questions by filling in the appropriate sections in the Jupyter Notebook. Export your final Jupyter Notebooks as HTML to produce the corresponding HTML files. Before submitting your Jupyter Notebooks and HTML files, make sure to restart the kernel and execute each cell such that all outputs and figures are visible.
In this assignment you slip into the role of a Data Scientist who has been hired by Airbnb. Airbnb is an online market place for vacation and short-term rentals of rooms or flats which operates in many countries in the world (see https://en.wikipedia.org/wiki/Airbnb). Airbnb collects data on the listing and users interacting with the platform. Let us define these terms first:
- User: A user is someone using the Airbnb platform (guest or host).
- Guest: A guest is a user who uses the Airbnb platform to book a room, flat or house.
- Host: A host is a user who offers a room, flat, or house for rent.
- Listing: A listing is a room, flat, or house offered for rent. In the dataset, each row corresponds to a listing.
Since we do not have access to Airbnb’s internal database, we will instead use data published by the website Inside Airbnb. The data has been acquired by web-scraping publicly available data from Airbnb. It is available online (http://insideairbnb.com/about.html) but you do not need to download any data from this website since we have downloaded and renamed all datasets you need and made them available on Learning Central.
In our scenario, you are responsible for the Amsterdam branch of Airbnb operations. Your main task is to provide insights on the data collected in Amsterdam as well as write algorithms to improve the experience of Airbnb users. The assignment is split into three parts. In the first two parts, you will focus on the numerical parts of the data. In the last part, you will focus on the text data.
• Part 1: Pre-processing and exploratory analysis
You start by reading the csv file into a Pandas DataFrame, cleaning the data and removing unwanted columns, performing conversions and calculating new columns. Then, you will perform exploratory analysis to look at the properties and distribution of the data, and answer a couple of questions your manager put forward to you.
• Part 2: Statistical analysis and recommender systems
Starting from the pre-processed DataFrame, you will perform statistical analysis using t- tests and linear regression to identify variables that significantly affect the price of the rent. Then, you will design a series of recommender systems that have been requested by users: a function that helps in setting the price for someone offering a new property, and a function that helps in selecting a city to visit given a particular budget.
• Part 3: Text analysis and ethics
You will mostly work with unstructured text data in a Pandas Dataframe representing user reviews.
To get you started with the dataset, you can have a look at a corresponding Jupyter Notebook (note: following and working through this link is not required for this assignment, but maybe you find some inspiration or useful information there). You are allowed to re-use code from the Jupyter Notebook provided that you properly reference it.
Part 1 – Pre-processing and exploratory analysis [25 marks]
You answer this question by filling in the first part of the Part1_2.ipynb Jupyter Notebook.
Question 1a – Drop columns (Total 4 marks)
When reading in the dataframe using the load_csv function, one can see that it contains a lot of textual data which will not be relevant for the numerical analyses in Part 1 and Part 2. Therefore, implement two functions drop_cols and drop_cols_na which remove some of the columns. Detailed instructions: [2 marks each]
• drop_cols(df): takes the dataframe as an input. It returns the reduced dataframe after dropping the following columns:
'scrape_id','last_scraped','description','listing_url','neighbourhood','calendar_last_ scraped', 'amenities','neighborhood_overview', 'picture_url','host_url', 'host_about', 'host_location','host_total_listings_count','host_thumbnail_url','host_picture_url', 'host_verifications','bathrooms_text','has_availability','minimum_minimum_nights','maximum
_minimum_nights','minimum_maximum_nights','maximum_maximum_nights','minimum_nights_avg_ntm ','maximum_nights_avg_ntm','number_of_reviews_l30d','calculated_host_listings_count','calc ulated_host_listings_count_entire_homes','calculated_host_listings_count_private_rooms','c alculated_host_listings_count_shared_rooms'
• drop_cols_na(df, threshold): drop columns according to the amount of NaN values they contain. threshold is a fraction between 0 and 1. If the fraction of NaNs in a column is equal or larger than the threshold, the respective columns is dropped. For instance, if threshold is 0.5, all columns that have at 50% or more NaNs are dropped. The default value for threshold is 0.5.
To solve the question, complete the functions in Question 1a in the notebook.
Question 1b – Recode and add columns (Total 6 marks)
We continue pre-processing by recoding some of the columns and adding columns with additional information. To this end, implement the following functions: binary_encoding, add_host_days, convert_price.
Detailed instructions: [2 marks each]
• binary_encoding(df): if we have a close look at some of the columns, we can see that some of them are binary variables representing 1 (True) and 0 (False) values. However, the values are encoded as the strings ‘t’ (for True) and ‘f’ (for False). Recode these columns by turning them into the integer numbers 0 and 1.
• add_host_days(df): it would be useful to have a column that represents the number of days (with respect to the current date) that the host has been registered. To this end, create a new column called 'host_days' (hint: look into Pandas’ to_datetime method).
• convert_price(df): the ‘price’ column represents the nightly price in USD. However the prices are encoded as strings. This function should convert the prices into floating point numbers. For instance, the string ‘$40’ should be converted to the floating point number 40.0.
To solve the question, complete the functions in Question 1b in the notebook.
Question 1c – Answering questions (Total 6 marks)
Your manager has a couple of questions about the dataset, and provided you with a list of questions they want answered. The questions are as follows: [1 mark each]
• How many hosts offer two or more properties for rent?
• What is the highest price for a listing?
• What is the ID of the listing that has the largest number of bedrooms?
• What is the ID of the listing with the largest advertised price?
• There are different room types. How many listings are there for the most common room type?
• How many hosts are there that have been registered for more than 3000 days?
To solve the questions, provide corresponding Pandas code in Question 1c in the notebook.
Question 1d – Exploratory analyses (Total 9 marks)
To further explore the dataset, you produce a number of exploratory plots. In particular, you set out to produce the following three plots: [3 marks each]
- Plot 1: A barplot with horizontal bars representing average nightly prices. Produce a bar for each neighbourhood (use the neighbourhood_cleansed column).
- Plot 2: The review ratings have 5 additional subitems: cleanliness, checkin, communication, location, value. You are interested in the correlations between the subitems. To this end, produce a correlation matrix that depicts all pair-wise Pearson correlations between these 5 variables.
- Plot 3: Your manager is interested in the geographical distribution of nightly prices for the more expensive listings. To this end, produce a scatterplot using latitude/longitude as coordinates. The price should be encoded both by color and size of the circles. Make sure to include only those listings with a price larger than $150.
The following figures are indications of how such plots might look like. Note that title, labels etc are missing. It is just intended for orientation. Your solution can have a different style, colors, etc.
Part 2 – Statistical analysis and recommender systems [35 marks]
You answer this question by filling in the second part of the Part1_2.ipynb Jupyter Notebook.
Question 2a – Linear regression and t-tests (Total 8 marks)
In this question you will perform a linear regression and a number of t-tests. Detailed instructions: [4 marks each]
- Linear regression: Review scores consist of an overall rating as well as additional scores on subitems (accuracy, cleanliness, checkin, communication, location, value). Can the overall rating be predicted from the scores on the subitems? To investigate this, perform a linear regression on rating using statsmodels. After fitting the model, print the summary. Make sure that the variable names shown in the summary are short and readable (accuracy, cleanliness, checkin, communication, location, value).
- T-tests: You want to investigate whether the difference in prices for different room types is significant. To investigate this, perform independent samples t-tests for the prices of each combination of room types using alpha = 0.01. Which room types are significantly different in terms of room type? Do your conclusions change if you perform Bonferroni correction to the alpha level (see https://en.wikipedia.org/wiki/Bonferroni_correction). Finally, create a Dataframe with 4 rows and 4 columns that holds the p-values for all pairwise combinations. The column/row names should be the different room types, and each entry in the dataframe corresponds to the p-value for the respective combination of room types.
To solve this question, provide corresponding code in Question 2a in the notebook and provide short answers in the space designated as YOUR ANSWER.
Question 2b – Linear regression with variable selection (Total 15 marks)
In this question you investigate how well nightly price can be predicted from the other variables in the dataframe. You need to decide yourself which variables to choose, but make sure you have at least 10 variables. The only requirement is that you use room_type as a predictor. Because room_type is a categorical variable, you first have to use dummy coding to turn it into a number of binary variables (hint: pd.get_dummies()). In the notebook, provide a short explanation for your choice of variables.
Starting from the variables you have chosen, our goal is to derive a sparse model with fewer variables. This process is called variable selection. In variable selection (‘variable’ means the same as ‘predictor’), variables get iteratively added or removed from the regression model.
Once finished, the model typically contains only a subset of the original variables. It makes it easier to interpret the model, and in some cases it makes it generalise better to new data. To perform variable selection, implement a function variable_selection(df, predictors, target, alpha)
where df is the listings dataframe, predictors is a list with your initial selection of at least 10 variables (e.g. [‘bedrooms’, ‘beds’, …]), target is the target variable for the regression (e.g. ‘price’), and alpha is the significance level for selecting significant predictors (e.g. 0.05). The function returns pred, the selected subset of the original predictors.
To calculate regression fits and p-values you can use statsmodels. Your approach operates in two stages: In stage 1, you build a model by adding variables one after the other. You keep adding variables that increase the adjusted R2 coefficient. You do not need to calculate it by hand, it is provided by statsmodels package. In stage 2, starting from these variables, if any of them are not significant, you keep removing variables until all variables in the model are significant. The output of the second stage is your final set of variables. Let us look at the two stages in detail.
CS 340 Milestone One Guidelines and Rubric Overview: For this assignment, you will implement the fundamental operations of create, read, update,
Retail Transaction Programming Project Project Requirements: Develop a program to emulate a purchase transaction at a retail store. This
7COM1028 Secure Systems Programming Referral Coursework: Secure
Create a GUI program that:Accepts the following from a user:Item NameItem QuantityItem PriceAllows the user to create a file to store the sales receip
CS 340 Final Project Guidelines and Rubric Overview The final project will encompass developing a web service using a software stack and impleme