logo Use SA10RAM to get 10%* Discount.
Order Now logo

Ask This Question To Be Solved By Our ExpertsGet A+ Grade Solution Guaranteed

expert
Adebayo Roqeeb AbiodunMathematics
(/5)

799 Answers

Hire Me
expert
Vikas BohraComputer science
(5/5)

972 Answers

Hire Me
expert
Elyza Marice GamissManagement
(5/5)

966 Answers

Hire Me
expert
Aarushi GoyalComputer science
(5/5)

990 Answers

Hire Me
Python Programming

In this assignment we will be looking at nucleic acid sequences and sequences contain up to four different bases denoted by letters.

INSTRUCTIONS TO CANDIDATES
ANSWER ALL QUESTIONS

Learning Outcomes:

CSC 110 Assignment 7: File I/O (Input/Output)

When you have completed this assignment, you should understand:

  • That well-tested functions that operate on lists work the same no matter the size of the list

  • How to read data from a text file using Python

How to hand it in:

Submit your assignment7.py file through the Assignment 7 link on the CSC110 conneX page.

Grading:

  • Late submissions will be given a zero grade.

  • You must use the py file provided to write your solution. Changing the filename or any of the code given in the file will result in a zero grade.

  • Your function names must match exactly as specified in this document or you will be given a

zero grade.

  • Function arguments must be exactly as specified in this document. Specifically, do not change the number of and/or order of the arguments or you will be given a zero grade for the function.

  • We will do spot-check grading in this course. That is, all assignments are graded BUT only a subset of your code might be graded. You will not know which portions of the code will be graded, so all of your code must be complete and adhere to specifications to receive

  • Your code must run without errors on the ECS 258 Lab machines or a zero grade will be

  • It is recommended that you use a plain text editor such as Notepad++ as used in the labs or Atom for Mac computers. We also recommend you run your programs through terminal / command prompt, as shown in the

  • It is the responsibility of the student to submit any and all correct files. Only submitted files will be marked. Submitting an incorrect file is not grounds for a

  • If the assignment requires the submission of multiple files, then all files must be

Marks will be given for…

  • your code producing the correct output

  • the tests for your functions providing sufficient coverage (at least 2 tests and enough to cover all possible paths through the code)

  • your code following good coding conventions (see lab and lecture code for examples)

    • Proper indentation

    • Documentation of signature and purpose using the format we have followed in lectures and previous

    • Names of variables should have meaning relevant to what they are storing

    • Use of whitespace to improve readability

    • Proper use of variables to store intermediate computation results

 

Get started…

  • Download py from the conneX Files tab and save it to your working directory.

  • Write your name and student V# at the top of the file

  • For each of the 9 function specifications provided below you must:

    • Uncomment the calls to each functions test

    • You are free to comment out these test calls in your main function as you progress through the assignment, but you MUST leave all of your tests in place for us to grade

    • Add any tests to the test function that you feel are necessary. The functions have tests provided for you, but you may want to add tests to adequately test each

    • Complete the function definition according to the specification

 

PART 1: Setting the stage (lists and loops):

An important task in bioinformatics is the identification of DNA and RNA sequences. In this assignment we will be looking at nucleic acid sequences. These sequences contain up to four different bases denoted by letters: A for adenine, C for cytosine, G for guanine, and T for thymine.

Sequence strings are compared in order to determine whether nucleic acid sequences match each other, or are related through mutations. Real sequence data as used by biochemists and in bioinformatics research consist of very long strings of A, C, G and T.

The sequences in this assignment will all contain between 2 and 4 of the possible bases (A, C, G, and T). Your task is to search through a collection of sequence data and count how many times a specific sequence occurs. (For example, if the collection contains the following sequences: [ACTG, GATC, ACT, GTC, AC, GATC, GA] and we search for the specific sequence GATC we would report that it was found 2 times (the two in bold and underlined).

One of the difficulties in this assignment will be dealing with mutated sequences. A mutation can occur due to insertions of additional bases within a sequence. For the purpose of this assignment, a mutated sequence contains at least two of the same bases occurring in a row (so in the sequence GAAATC the A has mutated, and in the sequence CCGGAT both the C and G have mutated). Another task in this assignment is to detect how many of the sequences in the collection are mutated.

The final task will be to search through the collection of sequence data for a specific sequence, but you must treat original and mutated sequences the same (For example, if the collection contains [TGC, AC, TTGC, TACG, TGGCC, AGTC] and we search for the specific sequence TGC we would report that it was found 3 times (because TTGC and TGGCC are mutated forms of TGC)

Exercise 1 – Find the longest string in a given list of strings

Complete the function design for the find_longest() function, which takes a list of strings as a parameter, and returns the longest string found in the list. If there a tie (two or more strings are tied for the longest in the list), the string found first is the list is returned.

 

Exercise 2 – Find the number of times a string occurs in a list

Complete the function design for the get_frequency() function, which takes a list of strings and a string as parameters, and returns a count of the number of elements in the list equal to the given string.

 

HINT for Remaining exercises:

The following exercises all involve mutations (read the “Setting the Stage” section on the previous page. Exercises involving mutations are a little more difficult.

In this assignment, a mutation occurs when two or more characters in a string are repeated in a row. Think about how you might be able to detect a mutation in a string. It is very similar to how we compared two adjacent list elements in a few exercises over the past few weeks. In fact, we can assign a prev variable and use list slicing on strings the same way we can on lists

Exercise 3 – Determining if a sequence is mutated

Complete the function design for the is_mutation() function that takes a string and determines if the string is mutated. For this assignment, a mutation means there is at least one occurrence where characters in the string occur two or more times in a row. Look at the hint at the top of the page.

Exercise 4 – Removing mutations from a sequence

Complete the function design for the break_mutation() function that removes all mutations. For this assignment, that means returning a string that has all duplicate letters removed from the given string. Remember, duplicate letters will only occur in a row (For example, “AACTTTG” may occur, as the duplicate A’s and T’s are all in a row, but “ATACTGA” would never occur, because the A’s and T’s are not in a row). Look at the hint at the top of the page.

Exercise 5 – Counting the number of mutated sequences

Complete the function design for the count_total_mutations() function that takes a list of strings as a parameter. The function should return a count of the number of strings in the list that are mutated. You should call one of the functions you designed above in your solution!

Exercise 6 – Counting the number of sequences, and mutations of that sequence

Complete the function design for the frequency_incl_mutations() function that takes a list of strings, and a string. The function should return the count of the number of elements in the list that are equal to that string, OR are equal to a mutation from that string! Remember that in this assignment, a mutation occurs when any base is repeated twice in a row. Removing the mutations from TGGGGAA would result in TGA. So, for this function, the result of searching for “TGA” in a list containing the elements: [“ACTG”, “TGA”, “TTGA”, “TGGGGAA”, “TGAC”] would result in 3, as TGA, TTGA, and TGGGGAA are all TGA or mutations of it. You should call one of the functions you designed above in your solution!

 

PART 2: Running your code with large sequences (File IO)!

On the Connex course page in the Files section for Assignment 7, there are a number of text files containing sequence data. Each file contains sequences (strings) separated by spaces. Your task is to read the data from a file and put it into a list. From there you can run each of your functions you designed in Part 1 to obtain some statistics about the input files. Download the 8 data files.

The input files get progressively larger:

  • txt – 5 sequences, without mutations

  • txt – 25 sequences, without mutations

  • txt – 20 sequences, few mutations

  • txt – 50 sequences, few mutations

  • txt – 20 sequences, many mutations

  • txt – 100 sequences, many mutations

  • txt – 1000 sequences, many mutations

  • txt – 10000 sequences, many mutations

 

Exercise 7 – Reading data from a file

Complete the function design for the get_file() function that takes prompts the user to enter a file name until the user enters the name of a file that can be successfully read from. Try it with data1.txt.

Exercise 8 – Counting the number of mutated sequences

Complete the function design for the make_list() function that takes a file object as a parameter, and creates and returns a list of strings containing all of the strings found in the file.

Exercise 9 – Analysis of the given test files

Download the a7_tester.py and run it. It should call all of the functions you have created so far. The program asks you to enter a file name to read (like data1.txt or data7.txt), and then a sequence to search for. Here is some sample output based on our solution (with the things I entered underlined in red):

On the Connex Discussion Board, post your results when you search for other sequences, and in other text files (like data7.txt). That way you can all compare your results with each other.

Related Questions

. The fundamental operations of create, read, update, and delete (CRUD) in either Python or Java

CS 340 Milestone One Guidelines and Rubric  Overview: For this assignment, you will implement the fundamental operations of create, read, update,

. Develop a program to emulate a purchase transaction at a retail store. This  program will have two classes, a LineItem class and a Transaction class

Retail Transaction Programming Project  Project Requirements:  Develop a program to emulate a purchase transaction at a retail store. This

. The following program contains five errors. Identify the errors and fix them

7COM1028   Secure Systems Programming   Referral Coursework: Secure

. Accepts the following from a user: Item Name Item Quantity Item Price Allows the user to create a file to store the sales receipt contents

Create a GUI program that:Accepts the following from a user:Item NameItem QuantityItem PriceAllows the user to create a file to store the sales receip

. The final project will encompass developing a web service using a software stack and implementing an industry-standard interface. Regardless of whether you choose to pursue application development goals as a pure developer or as a software engineer

CS 340 Final Project Guidelines and Rubric  Overview The final project will encompass developing a web service using a software stack and impleme

Get Free Quote!

377 Experts Online