Learning outcomes of this assessment
The learning outcomes covered by this assignment are: • Provide a broad overview of the general field of ‘big data systems’ • Developing specialised knowledge in areas that demonstrate the interaction and synergy between ongoing research and practical deployment of this field of study.
Key skills to be assessed
This assignments aims at assessing your skills in: • The usage of common big data tools and techniques • Your ability to implement a standard data analysis process – Loading the data – Cleansing the data – Analysis – Visualisation / Reporting • Use of Python, SQL and Linux terminal commands
Recommended Reading
The module notes complimented by tools and techniques covered in other modules are sufficient literature for completing this assignment successfully.
For reference documentation: • Spark documentation (https://spark.apache.org/documentation.html) • Hive documentation (https://cwiki.apache.org/confluence/display/Hive/Home) • Impala documentation (https://www.cloudera.com/documentation/enterprise/latest/topics/ impala.html) • Sqoop documentation (https://sqoop.apache.org/docs/1.4.4/SqoopUserGuide.html) • MySQL documentation (https://dev.mysql.com/doc/refman/5.5/en/) • Python documentation (https://developers.google.com/edu/python/introduction and https: //matplotlib.org/users/intro.html)
2
Equipment and Facilities to be Used
For this assignment the Cloudera Virtual Machine provided for this module must be used. All processing must be done via scripts and code, and these must be stored and included in submission. Terminal commands must be stored in shell scripts, language specific code has to be stored in separate files (for example, SQL code must be stored in SQL scripts and python code must be stored in Python scripts). The solution has to be implemented using both SQL and Python.
Workload
For the successful completion of this assignment, a total of 80 hours should be budgeted.
Task
You will be given a dataset and a set of problem statements. Where possible (you will need to carefully explain any reasons for not supplying both solutions), you are required implement the solution in both SQL (using either Hive or Impala), and Spark (using pyspark or spark-shell).
General instructions
You will follow a typical data analysis process:
For steps 1, 2 and 3 you will use the virtual machine (and the software installed on it) that has been provided as part of this module. The data necessary for this assignment will be provided in a MySQL dump format which you will need to copy onto the virtual machine and start working with it from there.
The virtual machine has a MySQL server running and you will need to load the data into the MySQL server. From there you will be required to use Sqoop to get the data into Hadoop.
For the cleansing, preparation and analysis you will implement the solution twice (where possible). First in SQL using either Hive or Impala and then in Spark using either pyspark or spark-shell.
For the visualisation of the results you are free to use any tool that fulfils the requirements, which can be tools you have learned about such as Python’s matplotlib, SAS or Qlik, or any other free open source tool you may find suitable.
3
Extra features to be implemented
To get more than a “Satisfactory” mark, a number of extra features should be implemented. Features include, but are not limited to: • Creation of a single script that executes the entire process of loading the supplied data to exporting the result data required for visualisation. • The Spark implementation is done in Scala as opposed to Python. • Usage of parametrised scripts which allows you to pass parameters to the queries to dynamically set data selection criteria. For instance, passing datetime parameters to select tweets in that time period. • Plotting of extra graphs visualising the discovery of useful information based on your own exploration which is not covered by the other problem statements. • Extraction of statistical information from the data. • The usage of file formats other than plain text.
The data
You will be given a dataset containing simplified Twitter data pertaining to a number of football games. The dataset will be supplied in compressed format and will be made available online for download or can be supplied by USB memory stick. Further information regarding each game, including the teams playing and their official hashtags, start and end times, as well as the times of any goals, will also be provided.
Problem statements
You are a data analyst / data scientist working for an event security company who monitor real time events to analyse the level of potential disturbance. In order to asses commotion at an event, they monitor the Twitter feeds pertaining to the event. They would like answers to the following questions (in all the following, you should consider the half time and overtime as ‘during-game’)..
Questions / problem statements:
4
Report
A 4000-5000 word report that documents your solution.
Additional advice to the client will award marks above the “Satisfactory” grade. This could include but is not limited to: • Other findings based on your analysis of the data • Outline of algorithms which would extract further information from the data • Discussion of alternative visualizations that could prove useful Along with the report, you are expected to also fill in a self-assessment form.
Requirements / Marking Scheme
Requirement Assessment Weight Method (%) Data load and preparation Report & Demonstration 20% Data analysis Report & Demonstration 30% Report Report 30% Demonstration of the work Demonstration 10% Satisfactory response to questions Demonstration 10%
Notes • The assignment must be completed on your own. • The assignment must be completed on time – if you submit work late, it will be marked according to the University’s late submission policy
Unfair means
The University has strict policies on unfair means. It is your responsibility to ensure that you both understand these and adhere to them in the production of your assignment. Any submitted works with such content identifiable will be penalised in accordance with the University regulations
5
Submission
You submission should be a single ZIP file upload. The file should be named as:
<<Your surname>> <<Your name>>.zip – for example: Smith John.zip
All items in the zip file should also be prepended by your surname. (Ensure you replace “Smith” by your surname in the names below).
The following items must be included in your submission:
It is assumed that you will also address any social / legal and ethical issues surrounding the implementation of the project such as copyright, references, licenses, and web law.
Demonstration
You will need to demonstrate your working scripts and be prepared to discuss functionality and implementation. Demonstrations will be held in a room and at a time to be arranged after the submission deadline, most likely in week 12.
Assessment Criteria
The following assessment criteria are provided as a guide to the criteria that you need to satisfy in order to get a grade within each of the following ranges.
Extremely poor (0-9) • Totally inadequate demonstration of required knowledge. • Not able to apply the practical and analytical skills from their programmes. • No appropriate design methodology. • No demonstration of analysis evaluation or synthesis. • No evidence of the ability to self-manage a significant piece of work and critical self-evaluation of the process. • Little academic value; presentation is extremely poor; work has no structure or clarity; extremely poor use of language; no references; no attempt to provide evidence of sources used.
6
Very Poor (10-19) • Virtually no relevant knowledge demonstrated. • Fails to adequately apply the practical and analytical skills from their programme. • Very poor use of design methodology. • No meaningful analysis or evaluation or synthesis. • Unable to self-manage a significant piece of work and to identify appropriate issues for critical self-evaluation of the process for reflection. • Academic arguments presented are inappropriate or very poorly linked; presentation is very poor; work has little discernible structure or clarity; very poor use of language; lack of ability to source adequate material; very poor referencing.
Poor (20-29) • Inconsistent or inaccurate knowledge. • Limited and inappropriate and inaccurate application of the practical and analytical skills from their programme. • Poor use of methodology. • Descriptive, occasional attempts to analysis or evaluate material but lacks critical approach to evaluation or synthesis. • Identifies issues for reflection but lacks evidence of reflective processes. • Some but inconsistent ability to self-manage a significant piece of work or critical self-evaluation of the process. • Confusion or weakness in academic argument; presentation is poor; work is disorganised and lacks clarity; poor use of language; poor use of reference material; inappropriate or out dated sources with numerous referencing errors.
Inadequate (30-39) • Limited evidence of knowledge. • Inappropriate application of the practical and analytical skills from their programme. • Unsatisfactory design methodology. • Mainly descriptive evidence of analysis, inconsistent critical approach, little evaluation or synthesis. • Follows processes of reflection but fails to demonstrate insight; lacks coherence in the self-management of a significant piece of work. • Presentation is unsatisfactory; work is limited in terms of structure, coherence or clarity; limitations in academic style; unsatisfactory referencing with errors; limited ability to support content with relevant sources.
7
Unsatisfactory (40-49) • Basic knowledge with occasional inaccuracies. • Appropriate yet basic application of the practical and analytical skills from their programme. • Superficial depth or limited breadth, but an overall adequate identification of design methodology. • Critical analysis evident, with some evaluation and synthesis, although limited evidence of reflection. • Some evidence of an ability to self-manage a significant piece of work and critical self-evaluation of the process. • Some appropriate academic argument although not well applied and lacking in clarity; presentation of work is adequate in terms of structure, coherence, clarity and academic style; some inconsistencies; some grammar and syntax errors which detract from the content; narrow range of sources; referencing in presented work is adequate with some inconsistencies or inaccuracies; over utilises secondary sources; references used are inappropriate in terms of currency.
Satisfactory (50-59) • Mostly accurate knowledge with satisfactory depth and breadth of knowledge. • Solid application of the practical and analytical skills from their programme • Fair use of design methodology. • Sound critical analysis and evaluation or synthesis. • Demonstrates basic ability of synthesise information in order to formulate appropriate questions and conclusions; reflective process is utilised, with insight demonstrating planning for future practice; shows the ability to self-manage a significant piece of work and critical self-evaluation of the process. • Relevant academic argument; presentation of work is fair in terms of structure coherence, clarity and academic style; some inconsistencies in grammar and syntax; fair range of sources identified with appropriate referencing and few inaccuracies; appropriate use of primary and secondary sources.
Good (60-69) • Consistently relevant accurate knowledge with good depth and breadth. • Clear and relevant application of the practical and analytical skills from their programme. • Good use of design methodology. • Clear, in depth critical analysis, evaluation and academic argument with synthesis of different ideas and perspectives. • Utilises reflection to develop self and practice; aware of the influence of varied perspectives and time frames; demonstrates an ability to self-manage a significant piece of work and critical self-evaluation of the process.
8
Very Good (70-79) • Comprehensive knowledge demonstrating very good depth and breadth. • Clear insight into links between the practical and analytical skills from their programme. • Strong use of design methodology. • Very good analysis and synthesis of material with evidence of critical and independent thought. • Demonstrates ability to transfer knowledge between different contexts appropriately; balanced and mature approach to reflection used to enhance practice and performance; clear ability to self-manage a significant piece of work and critical self-evaluation of the process. • Presentation is of a very good standard, demonstrating a scholarly style. Very good grammar and syntax. Clear evidence of referencing to a wide range of primary and secondary sources which are used effectively in supporting the work.
Excellent (80-89) • Excellent depth of knowledge in a variety of contexts. • Coherent and systematic application of the practical and analytical skills from their programme. • Excellent use of design methodology. • Excellent critical analysis and synthesis. • Integrates the complexity of a range of knowledge and excellent understanding of its relevance; confident in their ability to self-manage a significant piece of work and critical self-evaluation of the process • Arguments handled skilfully with imaginative interpretation of material; presentation is excellent, well-structured and logical; demonstrates a scholarly style; excellent grammar and syntax.
Outstanding (90-100) • Outstanding knowledge. • Exceptional application of the practical and analytical skills from their programme. • Excellent professional execution of design methodology. • Outstanding critical analysis and synthesis. • Excels in self-managing a significant piece of work and critical self-evaluation of the process show an aptitude to formulate new questions, ideas or challenges.
9
CS 340 Milestone One Guidelines and Rubric Overview: For this assignment, you will implement the fundamental operations of create, read, update,
Retail Transaction Programming Project Project Requirements: Develop a program to emulate a purchase transaction at a retail store. This
7COM1028 Secure Systems Programming Referral Coursework: Secure
Create a GUI program that:Accepts the following from a user:Item NameItem QuantityItem PriceAllows the user to create a file to store the sales receip
CS 340 Final Project Guidelines and Rubric Overview The final project will encompass developing a web service using a software stack and impleme