CSI 5180 - Machine Learning for Bioinformatics
Version: Jan 13, 2025 12:00
In this lecture, we will explore the cell, including the different types of cells, their organisation, and composition. We will also introduce concepts from molecular evolution. Additionally, we will discuss the macromolecules that make up the cell and their basic structures. Throughout the lecture, we will emphasize the relevance of these concepts to machine learning and bioinformatics.
General objective
Tip
With a uOttawa IP address, you can access 300,000 books for free from Springer Nature Link. This includes full downloads in PDF or EPUB formats.
Cells are categorized based on the presence or absence of a nucleus:
Prokaryotes: These cells or organisms lack a membrane-bound, structurally distinct nucleus and other sub-cellular compartments. Bacteria exemplify prokaryotic life forms.
Eukaryotes: These cells or organisms possess a membrane-bound, structurally distinct nucleus along with well-developed sub-cellular compartments. Eukaryotes comprise all organisms except viruses, bacteria, and cyanobacteria (blue-green algae).
Eukaryotic cells typically exhibit larger dimensions than their prokaryotic counterparts.
In eukaryotic cells, genetic material (DNA) is organised and condensed more intricately than in prokaryotic cells.
Prokarya: Organisms classified as prokaryotes lack a defined nucleus. Exemplary species include Cyanobacteria (blue-green algae) and Escherichia coli (common bacteria).
Eukarya: Eukaryotic organisms possess cells with a well-defined nucleus. Examples include Trypanosoma brucei (a unicellular organism known for causing sleeping sickness) and Homo sapiens (a multicellular organism).
Archaea: Although archaea lack a nuclear membrane similar to prokaryotes, their transcription and translation mechanisms are more akin to those found in eukaryotes.
Methanococcus jannaschii is a methanogenic archaeon, notable for being the first archaebacterium to have its entire genome sequenced in 1996. Discovered in 1982, this organism inhabits the extreme environment of a white smoker vent at the Pacific Ocean’s seabed, located 2,600 meters deep. It thrives in temperatures ranging from 48°C to 94°C, with an optimal growth temperature of 85°C. Its genome comprises 1.66 megabases and includes 1,738 genes. Remarkably, 56% of these genes show no homology to those found in eukaryotic or prokaryotic organisms. Additionally, M. jannaschii possesses a single type of DNA polymerase, in contrast to the multiple types typically present in other genomes.
“The objectives of phylogenetic studies are (1) to reconstruct the genealogical ties between organisms and (2) to estimate the time of divergence between organisms since they last shared a common ancestor.”
“A phylogenetic tree is a graph composed of nodes and branches, in which only one branch connects any two adjacent nodes.”
“The nodes represents the taxonomic units, and the branches define the relationships among the units in terms of descent and ancestry.”
“The branch length usually represents the number of changes that have occurred in that branch.” (or some amount of time)
Monomers comprise two distinct components: a common segment that forms the molecular backbone shared by all monomers, and a unique segment that determines the monomer’s identity and properties.
ASCII (Unicode) representation of a polymer:
[ ]-[ ]-[ ]-[ ]-[ ]- ... -[ ]-[ ]
| | | | | | |
* @ * # + + @
We can categorize the structural hierarchy into four distinct levels of abstraction: primary, secondary, tertiary, and quaternary structures.
The primary structure or sequence is an ordered list of characters, from a given alphabet, written contiguously from left to right.
DNA (deoxyribonucleic acid): 4 letters alphabet,
\(\Sigma = \{A,C,G,T\}\)
RNA (ribonucleic acid): 4 letters alphabet,
\(\Sigma = \{A,C,G,U\}\)
Proteins: 20 letters alphabet,
\(\Sigma = \{A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y\}\)
In the case of nucleic acids (DNA and RNA), the building blocks are called nucleotides, whilst in the case of proteins they are called amino acids.
Examples of DNA, RNA and protein sequences in FASTA format.
> Chimpanzee Chromosome 1; A DNA sequence (size = 245,522,847 nt)
TAACCCTAACCCTAACCCTAACCCTAACC ... TCTCATGACAGTGAGTGAGTTCTCATGATC
> A01592; An RNA sequence (coding Beta Globin gene) (size = 441 nt)
AUGGUGCACCUGACUCCUGAGGAGAAGUCUGC ... GCAAGGUGAACGUGGAUGAAGUUGGUGGUG
FASTA: This format is the most prevalent for representing the primary structure of proteins and nucleic acids. It consists of a header line starting with ‘>’, followed by lines of sequence data. It’s simple and widely supported by bioinformatics tools.
FASTQ: While primarily used for storing raw sequencing data, FASTQ files include sequence information and quality scores for each nucleotide, making it useful for primary sequence data from sequencing technologies.
GenBank: This format, maintained by the National Center for Biotechnology Information (NCBI), is used for nucleotide sequences and includes annotations. It captures the primary sequence along with additional information such as gene features and references.
EMBL: Similar to GenBank, the EMBL format stores nucleotide sequences with detailed annotations. It is used primarily in European databases.
Nucleotides consist of a common component composed of a deoxyribose sugar (pentose) and a phosphate group.
The distinguishing feature of each nucleotide is the nitrogenous base.
Nitrogenous bases are categorized as purines, which are larger two-ring structures (adenine, A; guanine, G), and pyrimidines, which are smaller one-ring structures (cytosine, C; thymine, T).
In DNA, the nitrogenous bases include adenine (A), cytosine (C), guanine (G), and thymine (T).
In RNA, the bases are adenine (A), cytosine (C), guanine (G), and uracil (U), where uracil (U) replaces thymine (T).
DNA was first identified by Johann Friedrich Miescher in 1869, who initially dismissed its role in heredity.
In 1953, James Watson and Francis Crick, with Crick passing on July 28, 2004, proposed the double-helical structure of DNA.
This structural elucidation is widely regarded as the pivotal biological discovery of the 20th century.
Their model provided a molecular basis for Chargaff’s rules, elucidating the equimolarity of adenine with thymine and guanine with cytosine.
Crucially, the model elucidated the mechanism by which DNA underpins heredity through replication.
Length Measurement in Bases: The length of DNA or RNA molecules is commonly measured in bases. This is a standard unit of measurement for single-stranded nucleic acids. For example, a region 10 megabases long consists of 10 million bases.
Hybridization and Base Pairs: When nucleic acids hybridize, they form a double-stranded structure known as a duplex or double helix. In this context, the length is often measured in base pairs (bp) to reflect the paired nature of the strands. For instance, a region 10 megabase pairs (Mbp) long would consist of 10 million base pairs.
DNA Orientation: The orientation of a DNA molecule is crucial, akin to word order in natural languages, impacting the interpretation and processing of genetic information.
5’ to 3’ Convention: DNA sequences are conventionally read from the 5’ to 3’ end. This directionality is essential for subsequent processes, which will be detailed later. Elements preceding the 5’ end are termed upstream, while those following the 3’ end are referred to as downstream, reflecting their relative positions in genetic signaling.
DNA generally forms a right-handed double helix in the B form, which is the most common form of DNA in cells.
RNA typically forms an A form helix, which is also right-handed.
Z DNA is a known form of DNA that is a left-handed helix.
A DNA molecule is made of two complementary strands running in opposite directions, which refers to the antiparallel nature of the DNA double helix.
In DNA, nucleotide bases pair through hydrogen bonding according to specific rules:
These pairing rules ensure that A:T and G:C pairs align backbone atoms similarly in three-dimensional space, maintaining the uniformity of the double helical structure due to their isosteric nature.
The analogy between natural languages and molecular sequences is profound and extensive.
Since the 1950s, bioinformatics and linguistics have mutually influenced each other.
In bioinformatics, molecular sequences are analyzed using the edit distance (Levenshtein distance), a metric originally devised in computational linguistics.
Both fields employ tree structures to represent evolutionary relationships; bioinformaticians construct phylogenetic trees for molecular sequences, while linguists create analogous trees to depict the evolution of languages.
Similarly, hidden Markov models (HMMs), a key machine learning technique, have been adapted from linguistic applications to bioinformatics. HMMs are used for speech recognition and gene prediction.
Currently, concepts such as embeddings, transformers, and large language models are increasingly applied in bioinformatics.
For example, EMS-2 is a transformer-based protein language model, trained on a dataset of 250 million protein sequences, illustrating the continuing exchange of methodologies between these domains.
Marcel Turcotte
School of Electrical Engineering and Computer Science (EECS)
University of Ottawa