CSI5126. Algorithms in bioinformatics
Fall 2018
Assignment
Deadline: October 22, 2018, 18:00
[ PDF ]
Learning outcomes
- Conceive a linear time algorithm to solve a string problem
- Imagine an exact algorithm counting the number of possible sequence alignments
- Simulate random sequences and analyze the distribution of the scores for the alignment of randomly generated
sequences
1 Exact string matching (5 marks)
Given a linear time algorithm for the exact string matching problem, outline a linear time algorithm which determines if the string
is a circular permutation
of the string ,
i.e. and
have the same
length, consists of a
suffix of followed
by a prefix of .
For example, the string defabc is a circular permutation of abcdef.
- Give a detailed description of the algorithm. For instance, one should be able to write the actual source code given your
description.
- Analyze the time and space requirement of the algorithm.
2 Counting alignments (5 marks)
This question is about the edit distance problem, which consists in finding the minimum number of edit operations that are needed to
transform one input string into the other. We have seen in class that although there is an exponential number of alignments, the
problem can be efficiently solved in quadratic time and space using dynamic programming. Write a program, involving a
dynamic programming solution, to count the number of possible alignments for two given input sequences of length
and
respectively.
- This involves developing a recurrence equation that solves this problem.
- As well as its implementation.
The input for your program is two numbers, say 280 and 240, which are interpreted as the length of two sequences. The output is the
number of possible sequence alignments for two sequences of such lengths. Note: Make sure that your implementation works for
reasonable sizes of sequences, say 280 and 240.
3 Alignment of random sequences (5 marks)
- Write a computer program, involving a dynamic programming solution, to compute the local alignment (Smith-Waterman)
of two DNA sequences of length
and .
- Generate 1,000 pairs of random DNA sequences, of length
and ,
and plot the distribution of the scores.
- Does it look like any distribution that you know of?
You can assume that all four nucleotides are equiprobable, i.e. ,
and that
and . Your
program must find the maximum similarity score, for a local alignment, using the following transition/transversion substitution matrix,
and for
the cost of each indel.
|
|
|
|
|
| A | C | G | T |
|
|
|
|
|
A | 1 | -5 | -1 | -5 |
C | -5 | 1 | -5 | -1 |
G | -1 | -5 | 1 | -5 |
T | -5 | -1 | -5 | 1 |
|
|
|
|
|
|
Submit your program and plot along with your answer to this question.
4 Testing the significance of an alignment (5 marks)
Cases of food poisoning are unfortunately quite common — “Two people are paralyzed after drinking botulism-contaminated carrot
juice (…)” from Botulism-tainted juice paralyzes two in Canada, www.reuters.com, 2006/10/17 — we have been asked to help the
investigators.
Background information. Botulism is a rare disease caused by a toxin (a protein) expressed by the bacterium Clostridium botulinum.
From the above source of information: “Botulism can cause nausea, fatigue, double-vision, paralysis and respiratory failure. In severe
cases, the toxin can be fatal.”
The investigators suspect that at least one of the following proteins, Unknown A or Unknown B, has a similar domain to that of
Clostridium botulinum’s toxin.
- Which protein(s) is (are) related to this toxin?
- Provide a detailed description of the methodology that you have used to arrive at your conclusions.
>gi|27867582 (fragment of the known Clostridium botuninum toxin gene)
GTGAATCAGCACCTGGACTTTCAGATGAAAAATTAAATTTAACTATCCAAAATGATGCTT
ATATACCAAAATATGATTCTAATGGAACAAGTGATATAGAACAACATGATGTTAATGAAC
TTAATGTATTTTTCTATTTAGATGCACAGAAAGTGCCCGAAGGTGAAAATAATGTCAATC
TCACCTCTTCAATTGATACAGCATTATTAGAACAACCTAAAATATATACATTTTTTTCAT
CAGAATTTATTAATAATGTCAATAAACCTGTGCAAGCAGC
> Unknown A
TCTATCAAGTAGATTATTAAATACTACTGCACAAAATGATTCTTACGTTCCAAAATATGA
TTCTAATGGTACAAGTGAAATAAAGGAATATACTGTTGATAAACTAAATGTATTTTTCTA
TTTATATGCACAAAAAGCTCCTGAAGGTGAAAGTGCTATAAGTTTAACTTCTTCAGTTAA
TACAGCATTATTAGATGCATCTAAAGTTTATACGTTTTTTTCTTCAGATTTTATTAATAC
> Unknown B
TCCTGGCTCAGGACGAACGCTGGCGGCGTGTGCTTAACACATGCAAGTCGAGCGATGAAG
CTTCCTTCGGGAAGTGGATTAGCGGCGGACGGGTGAGTAACACGTGGGTAACCTGCCTCA
AAGTGGGGGATAGCCTTCCGAAAGGAAGATTAATACCGCATAACATAAGAGAATCGCATG
ATTTTCTTATCAAAGATTTATTGCTTTGAGATGGACCCGCGGCGCATTAGCTAGTTGGTA
Directives
- You must do this assignment individually.
- Assignments must be submitted through Brightspace (https://uottawa.brightspace.ca).
- Each program must be documented.
- For the programming questions, I should be able to run your programs without downloading additional libraries (note
that I a macOS user).
- Clearly identify yourself on every file that you hand in; this includes your name and student id.
A Frequently Asked Questions [FAQ]
- “Give us an example of input and output for the program that counts the number of alignments.”
The input can be as simple as two numbers,
and ,
which represent the length of the first and second sequence.
> java CountAlignments 10 20
4,354,393,801
- “What programming language can I use for this assignment?”
I should be able to read a wide variety of programming language. I have extensive experience with procedural, object-oriented,
logic, and functional programming languages. However, I would like to be able to run your program on my computer and I am
using macOS Mojave and High Sierra. In doubt, E-mail me.