CSI 5180 - Machine Learning for Bioinformatics
Version: Jan 17, 2025 10:51
The holy grail of pandemic preparedness is being able to predict how a virus will evolve just by looking at its genetic sequence. Those days are still a way off, but a growing number of research groups are using artificial intelligence (AI) to predict the evolution of SARS-CoV-2, influenza and other viruses.
The Central Dogma delineates the flow of genetic information: DNA is transcribed into RNA, which is subsequently translated into proteins.
This process involves three primary stages: replication, transcription, and translation.
Given that DNA resides in the nucleus, while protein synthesis occurs in the cytoplasm, the intermediate molecule messenger RNA (mRNA) is essential for conveying genetic information.
Base pair complementarity is crucial for the fidelity of both replication and transcription processes.
Proteins such as helicase, polymerase, and primase are integral to these processes, facilitating DNA unwinding, strand synthesis, and primer formation, respectively.
Transcription initiation of protein-coding genes is regulated by specific DNA sequences known as promoter signals.
This lecture examines the latter stages of the central dogma, focusing on translation and its dependence on the genetic code, codons, and tRNA-mediated amino acid delivery. It then shifts to genome organization, highlighting repetitive DNA and its impact on assembly algorithms and disease. The lecture concludes by outlining proteomic concepts and the intricate networks of biological interactions, emphasizing the multifaceted nature of gene expression and regulation.
General objective
\[ 4^1 < 20 \]
\[ 4^2 < 20 \]
\[ 4^3 > 20 \]
\[ 4^3 = 64 \]
Codons are arranged in contiguous, non-overlapping triplets.
Given the 64 possible codons, the genetic code is degenerate, meaning that multiple codons can correspond to the same amino acid.
DNA: TAC CGC GCC TAT TAC TGC CAG GAA GGA ACT
RNA: AUG GCG CCG AUA AUG ACG GUC CUU CCU UGA
Protein: M A P I M T V L P *
DNA: TAC CGC GCC TAT TAC TGC CAG GAA GGA ACT
RNA: AUG GCG CCG AUA AUG ACG GUC CUU CCU UGA
Protein: Met Ala Pro Ile Met Thr Val Leu Pro Stop
Translation involves the ribosome, a riboprotein complex, which works alongside transfer RNA (tRNA) molecules and various regulatory proteins.
These components ensure that tRNAs are charged with the correct amino acids.
Transfer RNAs (tRNAs) function as adaptor molecules crucial for protein synthesis.
They are encoded by non-protein-coding genes within the genome.
These genes undergo transcription to produce RNA, which serves as the final functional product.
Diversity: Bacteria possess approximately 30 to 45 distinct tRNAs, while eukaryotic organisms can have up to 50 distinct types, with humans having 48.
Structure and Function: Each tRNA is covalently bonded to a specific amino acid at one terminus and contains a triplet nucleotide sequence, termed the anti-codon, at the opposite terminus. This sequence is complementary to the mRNA codon.
Notation: tRNAPhe denotes a tRNA specifically charged with phenylalanine, one of the 20 standard amino acids.
Conformation: Typically, tRNAs are 70 to 90 nucleotides in length, adopting a conserved cloverleaf secondary structure. This common structural motif was illustrated on the preceding slide.
Aminoacyl-tRNA synthetases are enzymes that catalyze the attachment of specific amino acids to their corresponding tRNAs. Typically, organisms possess 20 distinct aminoacyl-tRNA synthetases, each dedicated to linking a particular amino acid to all isoaccepting tRNAs—that is, different tRNAs that are charged with the same type of amino acid.
Each tRNA possesses distinct characteristics that ensure its specific aminoacylation with the correct amino acid.
Ribosomes are complex macromolecular structures comprising 3 to 4 RNA molecules and 55 to 83 proteins.
In bacterial cells, ribosome numbers reach approximately 20,000, with eukaryotic cells hosting even more.
Ribosomes facilitate protein synthesis by precisely aligning messenger RNAs (mRNAs), transfer RNAs (tRNAs), and requisite protein factors.
They also play a catalytic role in several biochemical reactions integral to protein synthesis.
DNA: TAC CGC GCC TAT TAC TGC CAG GAA GGA ACT
RNA: AUG GCG CCG AUA AUG ACG GUC CUU CCU UGA
Protein: M A P I M T V L P *
DNA: TAC CGC GCC TAT TAC TGC CAG GAA GGA ACT
RNA: AUG GCG CCG AUA AUG ACG GUC CUU CCU UGA
Protein: Met Ala Pro Ile Met Thr Val Leu Pro Stop
Protein translation initiates at the start codon ATG (AUG in RNA), establishing the reading frame. It terminates at a stop codon.
Typically, proteins begin with methionine; however, alternative start codons such as GUG or UUG may be employed in specific mRNAs. Additionally, post-translational modifications can remove the N-terminal segment of the protein.
There are three stop codons, which are non-sense codons.
Of the 64 possible codons, 61 are sense codons, encoding 20 amino acids, with one codon specifically serving as the start codon, coding for methionine.
The genetic code is described as degenerate because multiple codons can encode the same amino acid. Consequently, a single amino acid sequence may be derived from multiple distinct DNA sequences, ensuring a unique translation.
Video playlists by the Amoeba Sisters, “on a mission to demystify science with humor and relevance.”
Species | Size |
---|---|
Potato spindle tuber viroid (PSTVd) | 360 |
Obelisk | 1,000 |
Human immunodeficiency virus (HIV) | 9,700 |
SARS-CoV-d (COVID-19) | 29,000 |
Bacteriophage lambda (\(\lambda\)) | 48,500 |
Mycoplasma genitalium (bacterium) | 580,000 |
Escherichia coli (bacterium) | 4,600,000 |
Ramazzottius varieornatus (tardigrade) | 55,800,000 |
Drosophila melanogaster (fruit fly) | 120,000,000 |
Homo sapiens (human) | 3,000 000,000 |
Bufo bufo (common toad) | 6,900,000,000 |
Podisma pedestris (mountain grasshopper) | 17,000,000,000 |
Lilium longiflorum (easter lily) | 90,000,000,000 |
Necturus lewisi (a salamander) | 118,000,000,000 |
Amoeba dubia (amoeba) | 670,000,000,000 |
The self-replicating genetic structures of cells containing the cellular DNA that bears in its nucleotide sequence the linear array of genes. In prokaryotes, chromosomal DNA is circular, and the entire genome is carried on one chromosome. Eukaryotic genomes consist of a number of chromosomes whose DNA is associated with different kinds of proteins.
The human genome consists of two primary components: nuclear and mitochondrial genome.
What are the genes?
The fundamental physical and functional unit of heredity. A gene is an ordered sequence of nucleotides located in a particular position on a particular chromosome that encodes a specific functional product (i.e., a protein or RNA molecule).
In higher organisms, protein-coding genes are composed of subsegments known as exons, which are interspersed with intervening sequences termed introns.
Contrary to common perception, genomes are not densely packed with genes.
In the human genome, the structure is as follows:
The platypus is a unique, semi-aquatic mammal native to eastern Australia, including Tasmania. It is notable for its unusual combination of features: a duck-bill, webbed feet, and a beaver-like tail. Unlike most mammals, it lays eggs and has venomous spurs on the males’ hind legs. The platypus is one of the few monotremes, a primitive group of egg-laying mammals, and it uses electroreception to locate prey underwater.
Repetitive sequences pose significant challenges to algorithms employed in sequence assembly due to their complex structures and redundancy.
The association of repetitive sequences with various diseases underscores their detection as a critical area of research in bioinformatics.
We investigated the translation phase of the central dogma, emphasizing its reliance on the genetic code, codons, and tRNA-facilitated amino acid transport.
We analyzed genome organization, focusing on repetitive DNA sequences and their implications for assembly algorithms and disease pathogenesis.
In conclusion, we delineated proteomic principles and the complex networks of biological interactions, highlighting the intricate nature of gene expression and regulation.
Marcel Turcotte
School of Electrical Engineering and Computer Science (EECS)
University of Ottawa