CSI 5180 - Machine Learning for Bioinformatics
Version: Jan 20, 2025 09:11
This lecture explores fundamental molecular biology concepts, focusing on the central dogma, the genetic code, and key biological elements: the genome, transcriptome, proteome, and epigenome. The session highlights the significance and application of these concepts in bioinformatics.
General objective
Personalized Medicine involves tailoring medical treatment to the individual characteristics of each patient, including genetic, environmental, and lifestyle factors.
Genetic and Metabolic Profiling: Utilizing an individual’s genetic makeup and metabolic information for therapeutic decisions is a cornerstone of personalized medicine. This approach can lead to better treatment responses and reduced side effects.
Drug Repurposing: Personalized medicine can facilitate drug repurposing by identifying subgroups of patients who may benefit from a drug that has adverse effects in the general population. This can optimize resource use and reduce drug development costs.
“Dimensionally, genetic data is huge,” explains Kim. “Humans have 3.2 billion DNA characters. If you factor in mutations, there are well over 50 million dimensions, and epigenetics is even bigger. Gene expression and transcription — add another 20,000. So put it all together, we’re easily looking at 100-200 million dimensions.
Ultimately, Kim would like to be able to use machine learning to draw all of this genetic, epigenetic and medical record and health data together into a single metric space.
“FinnGen is a research project in genomics and personalized medicine. It is large public-private partnership that has collected and analysed genome and health data from 500,000 Finnish biobank donors to understand the genetic basis of diseases.”
Note
The Personal Genome Project UK (PGP-UK) is one of few resources that recruits its participants under open consent and makes the resulting multi-omics data freely and openly available.
DNA, RNA, and proteins are linear sequences of nucleotides or amino acids, representing informational strings.
The Central Dogma of molecular biology elucidates the directional flow of genetic information, delineating the sequential processes by which one macromolecule type dictates the sequence of another.
Fundamentally, this paradigm posits that DNA is transcribed into RNA, which is subsequently translated into protein.
CRICK (1958)
The central dogma states that once ``information’’ has passed into a protein it cannot get out again. The transfer of information from nucleic acid to nucleic acid, or from nucleic acid to protein, may be possible, but transfer from protein to protein, or from protein to nucleic acid, is impossible. Information here means the precise determination of sequence, either of bases in the nucleic acid or of amino acid residues in the protein.
The structure of DNA elucidates the mechanism by which genetic information is faithfully transmitted across generations or from a parent cell to its daughter cells during the process of replication.
Before replication
5' - GATACA -> 3' A
||||||
3' <- CTATGT - 5' B
Generating B’ from A
5' - GATACA -> 3' A
5' - GATACA -> 3' A
||||||
3' <- CTATGT - 5' B'
Before replication
5' - GATACA -> 3' A
||||||
3' <- CTATGT - 5' B
Generating A’ from B
5' - TGTATC -> 3' B
5' - TGTATC -> 3' B
||||||
3' <- ACATAG -> 5' A'
Parent cell AB
5' - GATACA -> 3' A
||||||
3' <- CTATGT - 5' B
Daughter cell AB’
5' - GATACA -> 3' A
||||||
3' <- CTATGT - 5' B'
Daughter cell A’B
5' - TGTATC -> 3' B
||||||
3' <- ACATAG -> 5' A'
Parent cell AB
5' - GATACA -> 3' A
||||||
3' <- CTATGT - 5' B
Daughter cell AB’
5' - GATACA -> 3' A
||||||
3' <- CTATGT - 5' B'
Daughter cell A’B
5' - GATACA -> 3' A'
||||||
3' <- CTATGT - 5' B
The process relies critically on the complementarity of base pairs.
Each DNA strand acts as a template for synthesizing its complementary strand.
This results in the formation of two identical double helices, each comprising one parental strand, exemplifying a semi-conservative replication model.
DNA replication is facilitated by the enzyme DNA polymerase.
Keep these observations in mind as we proceed with the presentation.
Li and Graur (1991)
(\(\ldots\)) a gene is a sequence of genomic DNA (\(\ldots\)) that is essential for a specific function.
There are three (3) kinds of genes:
Necessity of an Intermediate Molecule: In eukaryotic cells, the spatial separation between the nucleus, where DNA resides, and the cytoplasm, where protein synthesis occurs, necessitates the existence of an intermediary molecule to facilitate the transfer of genetic information.
Transcription involves a straightforward one-to-one correspondence between each nucleotide in the DNA template and the resulting RNA strand. Specifically:
The resultant molecule is termed a (pre-) messenger RNA or transcript.
The consensus sequence for the core promoter in E. coli (Escherichia coli) is as follows:
TTGACA(N){16,18}TATAAT
What is the probability of this motif occurring?
The most straightforward model is the independent and identically distributed (i.i.d.) model.
This model assumes two main principles:
Independence: The probability of the entire motif is the product of the probabilities of each nucleotide at its respective position, indicating no interdependence between positions.
Identical Distribution: The probability distribution for nucleotides remains consistent across all positions in the motif.
Typically, maximum likelihood estimators are employed to determine these probability distributions. This involves gathering extensive sample data and using nucleotide frequency as an estimator for probability.
Consider the canonical promoter motif:
TTGACA(N){16,18}TATAAT
Assume uniform nucleotide distribution, with probabilities \(p_A = p_C = p_G = p_T = \frac{1}{4}\). Consequently, the likelihood of observing this motif is \(\frac{1}{4^{12}} \approx 6 \times 10^{-8}\).
Estimating Promoter Occurrence in E. coli:
Comparison with Eukaryotic Genomes:
Additional Regulatory Elements:
Walkthrough of the 1 minute 23 seconds animation.
Transcription factors assemble at a DNA promoter region found at the start of a gene. Promoter regions are characterised by the DNA’s base sequence, which contains the repetition TATATA and for this reason is known as the “TATA box”.
In prokaryotes, mRNA degradation occurs within minutes post-synthesis, whereas in eukaryotes, it takes several hours.
Regulatory and transport functions are mediated by sequences within the untranslated regions (UTRs) of the mRNA transcript.
Central Dogma
DNA → RNA → Protein as the core framework of gene expression.
Key Omics
Marcel Turcotte
School of Electrical Engineering and Computer Science (EECS)
University of Ottawa