logo Hurry, Grab up to 30% discount on the entire course
Order Now logo

Ask This Question To Be Solved By Our ExpertsGet A+ Grade Solution Guaranteed

expert
Hiamnshu BholaData mining
(5/5)

502 Answers

Hire Me
expert
Neha SharmaOthers
(5/5)

791 Answers

Hire Me
expert
Dennison BertonBusiness
(5/5)

786 Answers

Hire Me
expert
Wyatt RyesComputer science
(5/5)

941 Answers

Hire Me
Software Engineering

Reading Invoice Data in SparkTo get started, create a spark cluster in the Databricks console.Once your cluster is up and running

INSTRUCTIONS TO CANDIDATES
ANSWER ALL QUESTIONS

Part 1/3: Reading Invoice Data in Spark

1)   To get started, create a spark cluster in the Databricks console.  Once your cluster is up and running, take a screenshot and post it below. 

Step1: Creating a new spark cluster using Scala 2.11, Spark 2.4.3 and Python 3.0

 

Step2: Take a Screenshot of the running cluster

 

2)   Next, create a new notebook and execute the following code to print a sample of the

invoice dataset (provided by Databricks).  Again, provide a screenshot.

/dbfs/databricks-datasets/online_retail/data-001/data.csv

 

    with open("/dbfs/databricks-datasets/online_retail/data-001/data.csv") as f:

        x = ''.join(f.readlines())

    print(x)

Step3: Creating a new notebook

 

Step4: Displaying the built-in dataset inside of the python notebook

 

3)   Read the invoice CSV into a resilient distributed dataset (RDD).  Collect the first five rows and print them.  Take a screenshot of both the code and printed output and include it here.

 

Step5: Creating an RDD and displaying the first five rows of the dataset

Part 2/3: Answer the following questions regarding invoice data

 

For each question below, please:

  • Use map and reduce functions to answer the question.

  • Provide the snippet of Spark code that you used to answer the question.

  • Include a screenshot of your notebook that includes both the code and the printed

answer.

 

1)   Which customer in the dataset has spent the most on products?  The quantity

multiplied by the unit price will give you the total dollar amount spent per invoice line.

 

2)   What is the product description for the best-selling product in the dataset?  We will

define "Best Selling" as the product with the highest quantity sold.

 

3)   How much has each country spent on products?  The output should have two columns,

one being the country and the other being the gross dollar amount spent across all

products.  Sort the output by the dollar amount, descending.  Print the entire output,

showing a gross dollar amount for each country.

 

4)   What is the highest-grossing day in the dataset?  Again, use quantity multiplied by unit

price to get the revenue per line.

 

5)   Finally, try out one of Databrick's visualizations.  Note that you will need to convert

back to a DataFrame in order to visualize the data (hint: look at rdd.toDF()).  Create an

appropriate DataFrame for visualization and call display on it. 

Take a screenshot of your code and the resulting visualization.  You can find available

visualizations by expanding this icon at the bottom of a cell:

 

Part 3/3: Kafka Questions

1)   In one sentence or less, what is the purpose of Kafka.     

Kafka is an open source software which provides a framework for storing, reading and analyzing streaming data

 

2)   Describe two ways in which Kafka differentiates itself from other messaging

systems.

  1. Kafka is an active-active system. With replication configured it supports active-passive pattern for high availability. Most of the traditonal message brokers do not support active-active.

  2. Kafka scales well and scales horizontally, you can add more nodes to handle increasing load. I have seen systems scale upto billion msgs and 10s of TBs of data/day on few nodes of commodity hardware. JMS brokers in question can only scale vertically and will quickly hit gc limits 

 

3)   Describe one architectural decision that has contributed to Kafka’s scalability and

performance at scale.  Think specifically about how it interacts with the underlying

Operating System and/or Java Virtual Machine (JVM).

 

Kafkas Performance

Kafka relies heavily on the OS kernel to move data around quickly. It relies on the principals of Zero Copy. Kafka enables you to batch data records into chunks. These batches of data can be seen end to end from Producer to file system (Kafka Topic Log) to the Consumer. Batching allows for more efficient data compression and reduces I/O latency. Kafka writes to the immutable commit log to the disk sequential; thus, avoids random disk access, slow disk seeking. Kafka provides horizontal Scale through sharding. It shards a Topic Log into hundreds potentially thousands of partitions to thousands of servers. This sharding allows Kafka to handle massive load.

 

Kafkas Scalability

Kafka is a good storage system for records/messages. Kafka acts like high-speed file system for commit log storage and replication. These characteristics make Kafka useful for all manners of applications. Records written to Kafka topics are persisted to disk and replicated to other servers for fault-tolerance. Since modern drives are fast and quite large, this fits well and is very useful. Kafka Producers can wait on acknowledgment, so messages are durable as the producer write not complete until the message replicates. The Kafka disk structure scales well. Modern disk drives have very high throughput when writing in large streaming batches.

Related Questions

. The fundamental operations of create, read, update, and delete (CRUD) in either Python or Java

CS 340 Milestone One Guidelines and Rubric  Overview: For this assignment, you will implement the fundamental operations of create, read, update,

. Develop a program to emulate a purchase transaction at a retail store. This  program will have two classes, a LineItem class and a Transaction class

Retail Transaction Programming Project  Project Requirements:  Develop a program to emulate a purchase transaction at a retail store. This

. The following program contains five errors. Identify the errors and fix them

7COM1028   Secure Systems Programming   Referral Coursework: Secure

. Accepts the following from a user: Item Name Item Quantity Item Price Allows the user to create a file to store the sales receipt contents

Create a GUI program that:Accepts the following from a user:Item NameItem QuantityItem PriceAllows the user to create a file to store the sales receip

. The final project will encompass developing a web service using a software stack and implementing an industry-standard interface. Regardless of whether you choose to pursue application development goals as a pure developer or as a software engineer

CS 340 Final Project Guidelines and Rubric  Overview The final project will encompass developing a web service using a software stack and impleme