For this coursework, you will solve the given problems using MapReduce computational model and Mahout on Hadoop cluster. This coursework carries 30% weightage of total mark for the module.
Using the hourly data across all weather stations, find
· the daily maximum and minimum “Dry Bulb Temp” from all the weather stations
· the daily mean and variance of “Dry Bulb Temp” from all the weather stations
· the correlation between any successive days’ “Dry Bulb Temp” observations from all the weather stations
You are NOT going to use any package that gives the statistics. You MUST use MapReduce framework. Write the pseudo code for mapper and reducer functions for the above three tasks and implement them in Python. Note that while using mapper and reducer it is helpful to consider the following formulae for variance and correlation:
, where is the mean, is the observation, and N represents the number of observations.
Pearson Correlation = .
Q2) Cluster Analysis using Apache Mahout.
For this question, you can use the data (a set of text files that are placed in a folder) provided with Topic 4 for the k-means algorithm. You are welcome to use your own dataset for this question. If you choose to do so, please provide a link to the data in your report.
As we discussed in text clustering (Topic 4), the terms of the documents are considered as features in text clustering. The vector space model is an algebraic model that maps the terms in a document into n-dimensional linear space. However, we need to represent textual information (terms) as a numerical representation and create feature vectors using the numerical values to evaluate the similarity between data points.
Use Apache Mahout and perform the standard steps for the cluster analysis, 1) create sequence files from the raw text, 2) create a sparse (efficient) representation of the vectors, initialisation approximate centroids for K-Means, 3) run the K-Means algorithm, 4) get the final iteration’s clustering solution, and 5) evaluate the final solution
Further, you need to consider the following points in the analysis:
· Implement the K-Means clustering algorithm with cosine distance to cluster the instances into K clusters.
· Vary the value of K and comment on the precision.
· Plot a graph that shows the relation between the average distance to the centroid (or efficiency metric) and the K-value.
· Try to smooth graph so that you can explain the value for K as the best value such that beyond this value there wouldn’t be any significant reduction in the average distance to centroid.
· Consider other distance measure of your choice and compare the different clusters you obtained in both the cases. Discuss what is the best setting for K-means clustering for this dataset.
You need to include the following in your coursework submission:
(a) For Q1, submit the pseudo code and Python code for the mappers and reducers implementation for all the descriptive statistics along with some comments so that a layperson can follow. Anyone should be able to run your code and reproduce your results with the instructions that you have provided.
(b) For Q2, write a brief summary on the impact of parameter changes on the performance of k-means algorithm. For example, you may 1) compare different distance measures in K-Means algorithm discuss the merits and demerits and 2) present a table that shows the performance of K-means algorithm for different K values.
(c) In Q2, if you automate the process of varying K then submit the code for the implementation along with some comments so that a layperson can follow. Note that there is no penalty for not doing the automation and you may manually repeat the experiments for different K values.
(d) Submit a report on the experiments. This report will be a detailed explanation (Max 1500 words, excluding code and references) of what you explored, the results you obtained, and some discussion points on the limitations of MapReduce methodology and Hadoop’s MapReduce computing engine.
Credit will be given for:
· the depth and breadth of your investigation.
· the technical skills you demonstrate in your write-up.
· good use of the Hadoop cluster.
· critical evaluation of your work.
CS 340 Milestone One Guidelines and Rubric Overview: For this assignment, you will implement the fundamental operations of create, read, update,
Retail Transaction Programming Project Project Requirements: Develop a program to emulate a purchase transaction at a retail store. This
7COM1028 Secure Systems Programming Referral Coursework: Secure
Create a GUI program that:Accepts the following from a user:Item NameItem QuantityItem PriceAllows the user to create a file to store the sales receip
CS 340 Final Project Guidelines and Rubric Overview The final project will encompass developing a web service using a software stack and impleme