• Your solutions should be your own work and are to be submitted electronically to the course Moodle page by 12 noon on MONDAY, 26TH APRIL 2021.
• You can work either alone or in pairs for this assessment. It is up to you to form your own pairs. You MUST register your choices on Moodle by 12 noon on MONDAY, 29TH MARCH 2021, even if you choose to work alone.
• If you choose to work in a pair, you will be jointly responsible for the work that is submitted and you will be awarded the same mark.
• Ensure that you electronically ‘sign’ the plagiarism declaration on the Moodle page when submitting your work. If you choose to work in a pair, both of you should check what has been submitted before signing this declaration: if any plagiarism or collusion is identified with anyone outside your pair, you will share responsibility for it.
• Late submission will incur a penalty unless there are extenuating circumstances
(e.g. medical) supported by appropriate documentation and notified within one week of the deadline above. Penalties, and the procedure in case of extenuating circumstances, are set out in the latest editions of the Statistical Science Department student handbooks which are available from the departmental web pages.
• Failure to submit this in-course assessment will mean that your overall examination mark is
recorded as “non-complete”, i.e. you will not obtain a pass for the course.
• Submitted work that exceeds the specified word count will be penalized. The penalties are described in the detailed instructions below.
• Your solutions should be your own work. When uploading your scripts, you will be required to electronically sign a statement confirming this, and that you have read the Statistical Science department’s guidelines on plagiarism and collusion (see below).
• Any plagiarism or collusion can lead to serious penalties for all students involved, and may also mean that your overall examination mark is recorded as non-complete. Guidelines as to what constitutes plagiarism may be found in the departmental student handbooks: the relevant extract is provided on the ‘In-course assessment 2’ tab on the STAT0023 Moodle page. The Turn-It-In plagiarism detection system may be used to scan your submission for evidence of plagiarism and collusion.
• You will receive feedback on your work via Moodle, and you will receive a provisional grade. Grades are provisional until confirmed by the Statistics Examiners’ Meeting in June 2021.
When the Covid-19 pandemic was first recognised in early 2020, it quickly became apparent that age was the main risk factor for becoming seriously ill or dying from the disease. Researchers have also identified other risk factors including gender, social deprivation, pre-existing health conditions and ethnicity.1 Understanding these risk factors can potentially help to develop strategies for reducing deaths, for example by targeting appropriate healthcare resources in areas that need them the most.2
In the UK, the Office for National Statistics (ONS) publishes a variety of information on Covid. An ONS report from August 20203 produced a simple analysis of Covid death rates across England and Wales, between March and July 2020. In this assessment we will examine more closely the data used in that report and try to understand why some areas have more deaths than others, by linking to UK Census data on the socio-economic characteristics of the different areas.
We will use data consisting of the total numbers of reported deaths in the period March–July 2020, where Covid-19 was given as the cause of death, for each of 7201 “Middle Layer Super Output Areas” (MSOAs) in England and Wales. According to the ONS report cited above, Super
Output Areas are “small-area statistical geographies covering England and Wales”, each of which has a similarly sized population and remains stable over time. These data are from the ONS web site.4 They have been combined with demographic and socioeconomic data from the most recent UK Census in 2011, obtained by querying datasets at the Nomis Labour Market Statistics service; and also with some geographic information from the UK’s Open Geography Portal.
The data are provided in the file UKCovidWave1.csv, available from the ‘In-course assessment 2’ tab of the STAT0023 Moodle page. This contains an anonymised version of the original data. Full details, including the anonymisation procedure (which includes rounding of most variables) can be found in the Appendix to these instructions. The first 5 401 rows are complete, i.e., contain all values of the death count and covariates. The last 1 800 rows contain all values of the covariates, but -1 for the death counts.
Your task in this assessment is to use the data from the first 5 401 records, to build a statistical model that will help you to:
• Understand the social, demographic and economic factors associated with variation between MSOAs in numbers of Covid deaths during the period March–July 2020; and
• Estimate the numbers of deaths for each of the 1 800 records where you don’t have this information.
1 See, for example, Williamson et al. (2020): “Factors associated with COVID-19-related death using
OpenSAFELY” (Nature 584, pp. 430–436).
2 For a more general overview of the key role that statistics has to play in responding to crises, see the Royal Statistical Society’s Ten recommendations on better use of stats and data in a pandemic, released on 8th March 2021.
3 ONS Statistical Bulletin “Deaths involving COVID-19 by local area and socioeconomic deprivation: deaths occurring between 1 March and 31 July 2020”, published August 2020.
4 Here and elsewhere, clicking on the blue text will take you to the relevant web site.
You may use either R or SAS for this assessment.
1. Read the data into your chosen software package and carry out any necessary recoding (e.g. to deal with the fact that -1 represents a missing value).
2. Carry out an exploratory analysis that will help you to start building a sensible statistical model to understand and predict the numbers of Covid deaths in each MSOA. This analysis should aim to identify an appropriate set of candidate variables to take into the subsequent modelling exercise, as well as to identify any important features of the data that may have some implications for the modelling. You will need to consider the context of the problem to guide your choice of exploratory analysis. See the ‘Hints’ below for some ideas.
3. Using your exploratory analysis as a starting point, develop a statistical model that enables you to predict the number of Covid deaths for each MSOA based on (a subset of) the other variables in the dataset, and also to understand the variation in deaths between different MSOAs. To be convincing, you will need to consider a range of models and to use an appropriate suite of diagnostics to assess them. Ultimately however, you are required to recommend a single model that is suitable for interpretation, and to justify your recommendation. Your chosen model should be either a linear model, a generalized linear model or a generalized additive model.
4. Use your chosen model to predict the number of Covid deaths for each MSOA where this information is missing, and also to estimate the standard deviation of your prediction errors.
Submission for this assessment is electronic, via the STAT0023 Moodle page. You are required to submit three files, as follows:
• A report on your analysis, not exceeding 2 500 words of text plus two pages of graphs and
/ or tables. The word count includes titles, footnotes, appendices, references etc. — in fact it includes everything except the two pages of graphs / tables and, if present, the separate page describing the contribution of each pair member (see below). Your report should be in three sections, as follows:
Section I: Describe briefly what aspects of the problem context you considered at the outset, how you used these to start your exploratory analysis, and what were the important points to emerge from this exploratory analysis.
Section II: Describe briefly (without too many technical details) what models you considered in step (3) above, and why you chose the model that you did.
Section III: State your final model clearly, summarise what your model tells you about the factors associated with variation of death counts in each MSOA, and discuss any potential limitations of the model.
Your report should not include any computer code. It should include some graphs and / or tables, but only those that support your main points. Graphs and tables must appear on separate pages, or they will be included in the word count.
In addition to your data analysis, if you are working as a pair then you must include an additional page at the end of their report where each pair member briefly describes their contribution to the project. You will need to agree this in your pairs before submitting the report. If both pair members agree that they contributed equally then it is sufficient to write a single sentence to that effect, or alternatively you are very welcome to describe your own personal contribution to the project. Note that this page will not be marked and does not contribute to the word count; nor will different marks be allocated to different pair members based on this. The purpose is to encourage you all to be mindful about contributing to this piece of group-work.
Your report should be submitted as a PDF file named as ########_rpt.pdf, where ######## is your group ID, with any spaces replaced by underscores (IMPORTANT!!!). For example, if your group ID is ‘ICA2Group C’, your report should be named ICA2Group_C_rpt.pdf.
• An R script or SAS program corresponding to your analysis and predictions. Your script/program should run without user intervention on any computer with R or SAS installed, providing the file UKCovidWave1.csv is present in the current working directory / current folder. When run, it should produce any results that are mentioned in your report, together with the predictions and the associated standard deviations. The script / program should be named ########.r or ########.sas as appropriate, where ######## is your group ID with underscores instead of spaces. For example, if your
group ID is ‘ICA2Group C’ and you use R, your should be named ICA2Group_C.r.
You may not create any additional input files that can be referenced by your script; nor should you write any code that requires access to the internet in order to run it. If you use R however, you may use the following additional libraries if you wish (together with other libraries that are loaded automatically by these): mgcv, ggplot2, grDevices, RColorbrewer, lattice and MASS. You may not use any other add-on libraries: for present purposes, an “add-on library” is one that requires a library() or require() command or equivalent (e.g. the package::command syntax) before it can be used, if your R system is installed using default settings.
• A text file containing your predictions for the 1 800 observations with missing counts. This file should be named ########_pred.dat, where ######## is your group ID with underscores instead of spaces. The file should contain three columns, separated by spaces and with no header. The first column should be the record identifier (corresponding to variable ID in file UKCovidWave1.csv); the second should be the corresponding count prediction, and the third should be the standard deviation of your prediction error.
• NOTE: if you work in pairs, both members of a pair must confirm their submission on Moodle before the submission deadline.
Marking criteria
There are 75 marks for this exercise. These are broken down as follows:
• Report: 40 marks. The marks here are for: displaying awareness of the context for the problem and using this to inform the statistical analysis; good judgement in the choice of exploratory analysis and in the model-building process; a clear and well-justified argument; clear conclusions that are supported by the analysis; and appropriate choice and presentation of graphs and / or tables. The mark breakdown is as follows:
– Awareness of context: 5 marks.
– Exploratory analysis: 10 marks. These marks are for (a) tackling the problem in a sensible way that is justified by the context (b) carrying out analyses that are designed to inform the subsequent modelling.
– Model-building: 10 marks. The marks are for (a) starting in a sensible place that is justified from the exploratory analysis (b) appropriate use of model output and diagnostics to identify potential areas for improvement (c) awareness of different modelling options and their advantages and disadvantages (d) consideration of the social, economic and demographic context during the model-building process.
– Quality of argument: 5 marks. The marks are for assembling a coherent ‘narrative’, for example by drawing together the results of the exploratory analysis so as to provide a clear starting point for model development, presenting the model- building exercise in a structured and systematic way and, at each stage, linking the development to what has gone before.
– Clarity and validity of conclusions: 5 marks. These marks are for stating clearly what you have learned about how and why the numbers of deaths vary between MSOAs, and for ensuring that this is supported by your analysis and modelling.
– Graphs and / or tables: 5 marks. Graphs and / or tables need to be relevant, clear and well presented (for example, with appropriate choices of symbols, line types, captions, axis labels and so forth). There is a one-slide guide to ‘Using graphics
effectively’ in the slides / handouts for the Week 1 videos for the course. Note that you will only receive credit for the graphs in your report if your submitted script / program generates and automatically saves all of these graphs when it is run.
Note that you will be penalised if your report exceeds EITHER the specified 2 500-word limit or the number of pages of graphs and / or tables. Following UCL guidelines, the maximum penalty is 7 marks, and no penalty will be imposed that takes the final mark below 30/75 if it was originally higher. Subject to these conditions, penalties are as follows:
– More than two pages of graphs and / or tables: zero marks for graphs and / or tables, in the marking scheme given above.
– Exceeding the word count by 10% or less: mark reduced by 4.
– Exceeding the word count by more than 10%: mark reduced by 7.
In the event of disagreement between reported word counts on different software systems, the count used will be that from the examiner’s system. The examiners will use an R function called PDFcount to obtain the word count in your PDF report: this function is available from the Moodle page in file PDFcount.r.
• Coding: 15 marks. There are 3 marks here for reading the data, preprocessing and setting up variable names correctly and efficiently; 7 marks for effective use of your chosen software in the exploratory analysis and modelling (e.g. programming efficiently and correctly); and 5 marks for clarity of your code — commenting, layout, choice of variable / object names and so forth.
• Prediction quality: 20 marks. The remaining 20 marks are for the quality of your predictions. Note, however, that you will only receive credit for your predictions if your submitted ########_pred.dat file is identical to that produced by your script / program when it is run: if this is not the case, your predictions will earn zero marks.
For these marks, you are competing against each other. Your predictions will be assessed using the following score:
CS 340 Milestone One Guidelines and Rubric Overview: For this assignment, you will implement the fundamental operations of create, read, update,
Retail Transaction Programming Project Project Requirements: Develop a program to emulate a purchase transaction at a retail store. This
7COM1028 Secure Systems Programming Referral Coursework: Secure
Create a GUI program that:Accepts the following from a user:Item NameItem QuantityItem PriceAllows the user to create a file to store the sales receip
CS 340 Final Project Guidelines and Rubric Overview The final project will encompass developing a web service using a software stack and impleme