logo Hurry, Grab up to 30% discount on the entire course
Order Now logo

Ask This Question To Be Solved By Our ExpertsGet A+ Grade Solution Guaranteed

expert
Sina AntiqueStatistics
(5/5)

892 Answers

Hire Me
expert
Adarsh Vikram RaiComputer science
(/5)

723 Answers

Hire Me
expert
rahul kumarOthers
(/5)

799 Answers

Hire Me
expert
Bryan CoxComputer science
(5/5)

760 Answers

Hire Me
Python Programming
(5/5)

write your own code for splitting the document into tokens

INSTRUCTIONS TO CANDIDATES
ANSWER ALL QUESTIONS

Following nomenclature to be used for this assignment

1. Report text available against each student  document or instance

2. Words in the report text -> tokens

3. Splitting a sentence or document into constituent words  tokenization

4. Stop-words  commonly used words; meaning could still be retained without them

5. White spaces, commas and full stops to act as token splitter delimiters.

M1. Load the documents for all the 37 students. Tokenize all the documents and store the tokens. You may:

use the NLTK package for tokenization, or

(Suggestion: Python users 

import nltk

from nltk.tokenize import word_tokenize).

write your own code for splitting the document into tokens (words)

M2. Merging the tokens from all documents, create a master list of distinct tokens available across all documents. Let us call this as “token population”

M3. Load the stop-words using NLTK package. Study these stop-words. What do you think they represent?

(Suggestion: Python users  from nltk.corpus import stopwords).

M4. Create a “bag-of-words” from the “token population” by removing the stop-words.

M5. For each document / instance, create 2 feature vectors as follows

o First vector attributes indicate the presence / absence of tokens from bag-of-words

in that document

o Second vector attributes indicate the count of presence of each word from bag-of- words in the document

o Do this for all documents. After this is done, you should have 37 vectors each of first and second kind.

M6. Calculate the Jaccard Coefficient between the document vectors. Use first vector for each document for this. Jaccard coefficient between two binary vectors is defined as:

JC = (f11) / (f01+ f10+ f11)

f11= number of attributes where x was 1 and y was 1; x & y are vectors

 

M7. Calculate the Cosine similarity between the documents by using the second feature vector for each document.

 

(5/5)
Attachments:

Related Questions

. The fundamental operations of create, read, update, and delete (CRUD) in either Python or Java

CS 340 Milestone One Guidelines and Rubric  Overview: For this assignment, you will implement the fundamental operations of create, read, update,

. Develop a program to emulate a purchase transaction at a retail store. This  program will have two classes, a LineItem class and a Transaction class

Retail Transaction Programming Project  Project Requirements:  Develop a program to emulate a purchase transaction at a retail store. This

. The following program contains five errors. Identify the errors and fix them

7COM1028   Secure Systems Programming   Referral Coursework: Secure

. Accepts the following from a user: Item Name Item Quantity Item Price Allows the user to create a file to store the sales receipt contents

Create a GUI program that:Accepts the following from a user:Item NameItem QuantityItem PriceAllows the user to create a file to store the sales receip

. The final project will encompass developing a web service using a software stack and implementing an industry-standard interface. Regardless of whether you choose to pursue application development goals as a pure developer or as a software engineer

CS 340 Final Project Guidelines and Rubric  Overview The final project will encompass developing a web service using a software stack and impleme