Explain how labeled training data can be used to estimate the probabilities needed by a Hidden Markov Model performing part of speech tagging
INSTRUCTIONS TO CANDIDATES
ANSWER ALL QUESTIONS
Section 1
Jan 2016
- What makes part-of-speech tagging non-trivial?
- Tags are assigned per word and per punctuation (every word and punctuation gets an individual tag.) Parts of speech can be ambiguous and can be assigned different tags based on context
- Explain how labeled training data can be used to estimate the probabilities needed by a Hidden Markov Model performing part-of-speech tagging.
- How would you go about performing a quantitative evaluation of a part-of-speech tagger? What method could you use to investigate whether the tagger was often making the same sorts of tagging mistakes?
- Why might part-of-speech tagging be a useful processing step in a system that judges whether a document is expressing mostly opinions or mostly facts?
- What are the similarities and the differences between stemming and lemmatising?
- Stemming and lemmatization both generate the root form of the inflected word. The difference is that the stem might not be an actual word, whereas lemma is an actual language word. Stemming follows an algorithm with steps to perform on the words, which makes it faster.
Jan 2017
- Document-level sentiment analysis is a document classification task in which the whole documents are assigned positive, negative, and possibly neutral sentiment. Discuss the limitations of treating sentiment analysis as a document-level classification task, i.e. where a single decision is made regarding the entire document as to whether it is positive, negative, or neutral. Illustrate your points with examples.
- Explain the purpose of add-one smoothing in the context of the Naive Bayes method for the document classification, and explain as precisely as you can how it has an impact on the way that the parameters of a Naive Bayes classifier are estimated.
- Explain as fully as possible why a Naive Bayes sentiment classifier that is trained on Amazon product reviews from the Books category may not perform well when applied to product reviews from a different product category. Illustrate your points with examples.
- Describe in some detail how you would go about building a Naive Bayes document classifier that is able to determine whether or not articles being received on some general news feed concern some “science and technology” related topic. You should explain how you would determine the accuracy of the classifier you have built in a way that is methodologically sound.
- Naive Bayes classification treats documents as bags of words. Explain what this means, and give an example of a situation where a document classifier is needed, but it would not be effective for the classifier to use a bag-of-words representation for documents.
Jan 2018
-
Why is NLP hard? Your answer should present five different aspects of language that make automated language processing challenging.
- You are building a system capable of comparing documents according to the topic. As a starting point, you have decided to treat documents as a simple bag of words. Explain what this means and suggest an example of an application where treating a document as a bag-of-words would not be effective?
- Why might it be important to carry out token canonicalization as part of the preprocessing of the documents? Give examples of where canonicalization might be useful and note two possible approaches to token canonicalization, explain their similarities and differences.
- After some experimentation, you note that for documents it is important to take account of phrasal terms in order to properly characterize their topic. Why might this be the case and how might you go about identifying suitable phrasal terms? You should illustrate your answer using suitable examples.
- In order to compare documents according to a topic, you require a means of measuring similarity in terms of characteristic terminology. How might you go about detecting characteristic terminology?
Jan 2019
- What is sentence segmentation and why is it a non-trivial task? You should use one or more examples to illustrate your answer.
- Using the sentence below to illustrate your answer, explain what is meant by tokenization and how it might be achieved in practice. Standard Mail rose 6% to £17.90p Tuesday: investors express excitement at half-year profit figures.
- Many languages (for example, Chinese and Japanese) make use of writing systems that do not mark word boundaries with whitespace. Describe an approach to tokenization that might be used in this case.
- Why is accurate Is tokenization important? You should illustrate your answer with reference to two different downstream processing tasks that depend on accurate tokenization.
- Document pre-processing often involves text normalization. Explain why text normalization is performed and what is involved for each case normalization, stop word removal and number normalization.
- Text normalization can sometimes result in a loss of useful information. Illustrate this by describing three different cases where text normalization may lead to problems for downstream processing tasks.
- Explain what is meant by a bag-of-words document representation. Your answer should give an example of a document processing a scenario where a bag-of-words representation would be appropriate, and an example where it would not.
Attachments:
Related Questions
. The fundamental operations of create, read, update, and delete (CRUD) in either Python or Java
CS 340 Milestone One Guidelines and Rubric
Overview: For this assignment, you will implement the fundamental operations of create, read, update,
. Develop a program to emulate a purchase transaction at a retail store. This program will have two classes, a LineItem class and a Transaction class
Retail Transaction Programming Project
Project Requirements:
Develop a program to emulate a purchase transaction at a retail store. This
. The following program contains five errors. Identify the errors and fix them
7COM1028
Secure Systems Programming
Referral Coursework: Secure
. Accepts the following from a user: Item Name Item Quantity Item Price Allows the user to create a file to store the sales receipt contents
Create a GUI program that:Accepts the following from a user:Item NameItem QuantityItem PriceAllows the user to create a file to store the sales receip
. The final project will encompass developing a web service using a software stack and implementing an industry-standard interface. Regardless of whether you choose to pursue application development goals as a pure developer or as a software engineer
CS 340 Final Project Guidelines and Rubric
Overview The final project will encompass developing a web service using a software stack and impleme