CSI5126. Algorithms in bioinformatics
Fall 2018

Assignment 1

Deadline: October 1, 2018, 18:00

[ PDF ]

Solution

Learning outcomes

In the work place, one would use an existing application or API to perform the tasks of this assignment — see the Resources Section. However, I believe that writing simple programs by yourselves to carry out these tasks can help you learn more easily the the biology.

Instructions

For all the questions, assume that the information is stored in FASTA format 1 . I am also expecting to run your program from the command line:

$ java A1Q1 input.fa

Here, java refers to the Java Virtual Machine, A1Q1 is a file containing the byte-code of the java program (A1Q1.java was compiled to produce A1A1.class). Finally, input.fa is a file containing some input encoded using the FASTA format.

1 Transcription (5 marks)

Write a simple program taking as input a DNA sequence stored into a file. The program must transcribe the input to RNA. The result is displayed on the standard output. For instance, given a file with the following DNA content:

> Unknown  
ACTGTTGTTCGGTGATCATCAGTTGTACAACGTCCTAACAACATCACATGCAATGCTTATGATATTCTTC

Your program would display the following information on the output:

ACUGUUGUUCGGUGAUCAUCAGUUGUACAACGUCCUAACAACAUCACAUGCAAUGCUUAUGAUAUUCUUC

2 Reverse complement (5 marks)

Write a simple program taking as input a DNA sequence stored into a file. The program must display the reverse complement sequence. For instance, given a file with the following DNA content:

> Unknown  
ACTGTTGTTCGGTGATCATCAGTTGTACAACGTCCTAACAACATCACATGCAATGCTTATGATATTCTTC

Your program would display the following information on the output:

GAAGAATATCATAAGCATTGCATGTGATGTTGTTAGGACGTTGTACAACTGATGATCACCGAACAACAGT

3 All six reading frames (5 marks)

Write a simple program taking as input a DNA sequence stored into a file. The program must display all six translation reading frames. For example, given the follow DNA content:

> Unknown  
ACTGTTGTTCGGTGATCATCAGTTGTACAACGTCCTAACAACATCACATGCAATGCTTATGATATTCTTC

Your program would produce the following output. Here the star is used to represent the stop codon.

> 5’3’ Frame 1  
T V V R * S S V V Q R P N N I T C N A Y D I L  
 
> 5’3’ Frame 2  
L L F G D H Q L Y N V L T T S H A M L M I F F  
 
> 5’3’ Frame 3  
C C S V I I S C T T S * Q H H M Q C L * Y S  
 
> 3’5’ Frame 1  
E E Y H K H C M * C C * D V V Q L M I T E Q Q  
 
> 3’5’ Frame 2  
K N I I S I A C D V V R T L Y N * * S P N N S  
 
> 3’5’ Frame 3  
R I S * A L H V M L L G R C T T D D H R T T

4 Database search (5 marks)

One of our life science colleagues has just sequenced this DNA fragment. We would like to know if it corresponds to a protein coding sequence. If so, does it match a known protein sequence. To solve this problem, you must translate this DNA sequence into all six possible reading frames, and search each one of using the resources available at the National Center for Biotechnology Information (NCBI).

For the online search.

  1. Go to the NCBI Web site: https://www.ncbi.nlm.nih.gov.
  2. Go to the Sequence Analysis section of the Web site (hint: consult the menu on the left-hand side of the page).
  3. In the tools section, you will find a link entitled Basic Local Alignment Search Tool (BLAST). BLAST is a well known application “[to find] regions of similarity between biological sequences”.
  4. Since our inputs are protein sequences, go to the Protein BLAST Web page.
  5. We will be using the database RefSeq (reference proteins, refseq_proteins), which is a curated database.
  6. For all six reading frames, paste the sequence in the appropriate box and perform a search.

Here is the input DNA sequence.

> Unknown  
ACTGTTGTTCGGTGATCATCAGTTGTACAACGTCCTAACAACATCACATGCAATGCTTATGATATTCTTC  
TTCATCATGCCAGGCACGATGGCAGGACTAGGCAACTTACTAGTGCCATTCCAGATGAGTGTACCGGAGT  
TAGTATTCCCAAAGATTAATAACATCGGTATATGATTTTTAGTATGTGGTCTACTTTTGATTACGGGTTC  
ATCTTGGATGGAGGAAGGTTCAGGAACGGCCTGAACCGTCTATCCACCACTAGCGCTCACTGCAAGTCAT  
AGCGGACTTGCTGTAGATACGTTCATTATCGCATTGCACATGGCCGGTGCAAGCTCCCTTACAGGAAGCA  
TCAACCTTATATGTACAATCGCCTATGCCCGCCGTTCACTCATGGCGATGCTGCAGTCATCACTTTATCC  
CTGATCCATTACAATCACTGCAGCGTTACTCATAGGAGTTGTGCCTGTGCTAGCAGGTGCTATCACGATG  
CTACTCACTGATAGAAGTTGGAGTACCAGCTTCTATGACAGTTCGGCAGGCGGTGATCCTATGTTGTATC  
AGCACTTATTCTGGGTGTTTGGGCATCCAGAAGTCTATATCATCATACTTCCAGTATTCGGTATAGTCAG

Answer the following questions:

5 Genetic Code (5 marks)

Since its discovery 50 years ago, the genetic code 2 has never ceased to amaze. For instance, we now know that biases in codon usage play key roles in the subtle regulation of gene expression.

For this question, write a simple program to analyze the genetic code. In particular, your program must output the following information:

Resources

References

[1]   Christina E Brule and Elizabeth J Grayhack. Synonymous Codons: Choose Wisely for Expression. Trends in genetics : TIG, 33(4):283–297, April 2017.

[2]   Tessa E F Quax, Nico J Claassens, Dieter Söll, and John van der Oost. Codon Bias as a Means to Fine-Tune Gene Expression. Molecular cell, 59(2):149–161, July 2015.

A Frequently Asked Questions [FAQ]

  1. “None.”

    For now!

Modified October 15, 2018