Essential Cell Biology (Part 2)

CSI 5180 - Machine Learning for Bioinformatics

Marcel Turcotte

Version: Jan 20, 2025 09:11

Preamble

Quote of the Day

Previsous Lecture

Cellular life forms are categorized into two types: prokaryotic and eukaryotic.
Eukaryotic cells possess organelles, with certain organelles, such as mitochondria, containing their own DNA.
The three domains of life are Prokarya, Eukarya, and Archaea.
A phylogeny delineates the evolutionary relationships among organisms and their divergence times.
The primary macromolecules are DNA, RNA, and proteins.
Macromolecules are linear, unbranched polymers, wherein each monomer comprises a common backbone and a distinct, specific component, akin to nodes in a linked chain.

Summary

This lecture explores fundamental molecular biology concepts, focusing on the central dogma, the genetic code, and key biological elements: the genome, transcriptome, proteome, and epigenome. The session highlights the significance and application of these concepts in bioinformatics.

General objective

Describe the central dogma, transcription, translation, and genetic code.

Learning Outcomes

Central Dogma & Molecular Biology
- Summarize the flow of genetic information (DNA → RNA → Protein).
- Distinguish among replication, transcription, and translation.
Key Biological Elements
- Differentiate the genome, transcriptome, proteome, and epigenome.
- Recognize how DNA, RNA, and proteins interrelate in gene expression.
Gene Expression Mechanics
- Identify mRNA, tRNA, rRNA roles and the importance of codons/reading frames.
- Understand promoter regions and the basics of regulatory sequences.

Decoding a Genomic Revolution

Personalized Medicine

Personalized Medicine involves tailoring medical treatment to the individual characteristics of each patient, including genetic, environmental, and lifestyle factors.

Personalized Medicine

Genetic and Metabolic Profiling: Utilizing an individual’s genetic makeup and metabolic information for therapeutic decisions is a cornerstone of personalized medicine. This approach can lead to better treatment responses and reduced side effects.

Personalized Medicine

Drug Repurposing: Personalized medicine can facilitate drug repurposing by identifying subgroups of patients who may benefit from a drug that has adverse effects in the general population. This can optimize resource use and reduce drug development costs.

Personalized Medicine

“Dimensionally, genetic data is huge,” explains Kim. “Humans have 3.2 billion DNA characters. If you factor in mutations, there are well over 50 million dimensions, and epigenetics is even bigger. Gene expression and transcription — add another 20,000. So put it all together, we’re easily looking at 100-200 million dimensions.

Ultimately, Kim would like to be able to use machine learning to draw all of this genetic, epigenetic and medical record and health data together into a single metric space.

FinnGen

“FinnGen is a research project in genomics and personalized medicine. It is large public-private partnership that has collected and analysed genome and health data from 500,000 Finnish biobank donors to understand the genetic basis of diseases.”

PGP-UK

Note

The Personal Genome Project UK (PGP-UK) is one of few resources that recruits its participants under open consent and makes the resulting multi-omics data freely and openly available.

Proteins

What is a Protein? (from PDB-101)

20 (Naturally Occuring) Amino Acids

Genome

Chromosome Structure

www.wehi.edu.au/wehi-tv

This animation presents the various levels of DNA organization (wrapping) within mitotic chromosomes. Although the histone, nucleosome and chromatin structures were derived from published data, the middle “levels” of wrapping depicted here are controversial as to what form they take or even whether they exist. The top-level coils that create the chromatids in this animation are not usually shown in text-book diagrams, yet published data presents good evidence for their existence and structure as it is presented here.

The histone, nucleosome and DNA models were derived from their PDB (www.rcsb.org/pdb) structures and other published data. The histone’s N-terminal arms were added by hand, as they are not resolved by X-ray crystallography, due to their highly dynamic nature. The time-lapse footage of the mitotic cell was filmed by Prof Jeremy Pickett-Heaps (www.cytographics.com).

Chromatin

[!quote]

Chromatin refers to a mixture of [[DNA]] and [[Protein|proteins]] that form the [[Chromosome|chromosomes]] found in the [[Cell|cells]] of humans and other higher organisms. Many of the proteins — namely, [[Histone|histones]] — package the massive amount of DNA in a genome into a highly compact form that can fit in the cell nucleus.

Chromatin. The total DNA in the cell is about 5 to 6 feet long which has to fit inside the nucleus of a cell in an orderly fashion. DNA molecules first wrap around the histone proteins forming beads on string structure called [[Nucleosome|nucleosomes]]. Nucleosomes further coil and condense/gather to form fibrous material which is called chromatin. Chromatin fibers can unwind for DNA replication and transcription. When cells replicate, duplicated chromatins condense further to become a lot like chromosomes, visible under microscope which are separated into daughter cells during cell division.

National Human Genome Research Institute

3D Organization of Our Genome (New)

Bioinformaticist’s Perspective

Predicting histone binding sites based solely on DNA sequence data.
Utilizing histone locations to infer gene positions and regulatory element sites.
Exploring the three-dimensional genome organization, a current focal point in research.

Central Dogma

DNA, RNA, and proteins are linear sequences of nucleotides or amino acids, representing informational strings.
The Central Dogma of molecular biology elucidates the directional flow of genetic information, delineating the sequential processes by which one macromolecule type dictates the sequence of another.
Fundamentally, this paradigm posits that DNA is transcribed into RNA, which is subsequently translated into protein.

Central Dogma (1958)

Central Dogma (1958)

CRICK (1958)

The central dogma states that once ``information’’ has passed into a protein it cannot get out again. The transfer of information from nucleic acid to nucleic acid, or from nucleic acid to protein, may be possible, but transfer from protein to protein, or from protein to nucleic acid, is impossible. Information here means the precise determination of sequence, either of bases in the nucleic acid or of amino acid residues in the protein.

Central Dogma (today)

In our previous lecture, we discussed that retroviruses possess an enzyme called reverse transcriptase, which enables them to transcribe their RNA genome into DNA and integrate it into the host’s genome. Reverse transcription is a hallmark of retroviral replication, as seen in viruses like HIV.

RNA molecules can also serve as templates for synthesizing new RNA strands through the action of RNA-dependent RNA polymerases (RdRps). This process is characteristic of RNA viruses, which depend on RdRps for the replication of their RNA genomes. For example, in the case of the poliovirus, the viral RNA genome functions as a template for the synthesis of complementary RNA via RdRp. The complementary RNA strand subsequently serves as a template for generating new viral genomes, which are then assembled and released to infect more host cells.

In addition to their role in viral replication, RdRps are also involved in eukaryotic RNA interference pathways. Within these pathways, RdRps are responsible for amplifying microRNAs and small temporal RNAs by synthesizing double-stranded RNA using small interfering RNAs as primers. This amplification is essential for gene regulation and the cellular defense against viral infections.

Central Dogma (1956)

Central Dogma (DNA)

DNA: Functions as the repository of genetic information, akin to a comprehensive library of programs.

Central Dogma (RNA)

RNA: Serves multiple roles including:
- mRNA: Transcribes genetic information for protein synthesis.
- tRNA: Acts as an adaptor in protein synthesis.
- Ribosomal RNA: Integral to ribosomal structure and function.
- Regulatory RNAs: Involved in gene regulation and developmental processes (e.g., microRNAs, riboswitches).

Central Dogma (Proteins)

Proteins: Perform diverse biological functions, including catalysis, signaling, transport, and structural roles.

Replication

Replication (DNA to DNA)

DNA and Heredity

The structure of DNA elucidates the mechanism by which genetic information is faithfully transmitted across generations or from a parent cell to its daughter cells during the process of replication.

DNA and Heredity (Conceptual)

Before replication

5'  - GATACA -> 3' A
      ||||||
3' <- CTATGT -  5' B

Generating B’ from A

5'  - GATACA -> 3' A


5'  - GATACA -> 3' A
      ||||||
3' <- CTATGT -  5' B'

DNA and Heredity

Before replication

5'  - GATACA -> 3' A
      ||||||
3' <- CTATGT -  5' B

Generating A’ from B

5'  - TGTATC -> 3' B


5'  - TGTATC -> 3' B
      ||||||
3' <- ACATAG -> 5' A'

DNA and Heredity

Parent cell AB

5'  - GATACA -> 3' A
      ||||||
3' <- CTATGT -  5' B

Daughter cell AB’

5'  - GATACA -> 3' A
      ||||||
3' <- CTATGT -  5' B'

Daughter cell A’B

5'  - TGTATC -> 3' B
      ||||||
3' <- ACATAG -> 5' A'

DNA and Heredity

Parent cell AB

5'  - GATACA -> 3' A
      ||||||
3' <- CTATGT -  5' B

Daughter cell AB’

5'  - GATACA -> 3' A
      ||||||
3' <- CTATGT -  5' B'

Daughter cell A’B

5'  - GATACA -> 3' A'
      ||||||
3' <- CTATGT -  5' B

Remarks

Complex organisms develop from a single cell, proliferating into billions of cells. Each cell harbors an identical copy¹ of the DNA from its progenitor (parent) cell.

The redundancy within the DNA double helix allows the second strand’s information to be inferred from the first. This redundancy underpins DNA repair mechanisms, enabling the replacement of deleted bases and the detection of mismatches.

DNA Replication (Basic)

DNA Replication (Advanced)

DNA Replication (Extreme)

Replication: Summary

The process relies critically on the complementarity of base pairs.
Each DNA strand acts as a template for synthesizing its complementary strand.
This results in the formation of two identical double helices, each comprising one parental strand, exemplifying a semi-conservative replication model.
DNA replication is facilitated by the enzyme DNA polymerase.

Observations

Keep these observations in mind as we proceed with the presentation.

The process of DNA replication is facilitated by various enzymes, including DNA polymerase, Primase, Ligase, and DNA helicase.
An enzyme is a macromolecule that acts as a catalyst to accelerate specific chemical reactions, with most enzymes being proteins, including those mentioned.
What is the origin of proteins?
How are proteins regulated?

Transcription

Transcription (DNA to RNA)

Transcription (Basic)

Genes

Li and Graur (1991)

(\(\ldots\)) a gene is a sequence of genomic DNA (\(\ldots\)) that is essential for a specific function.

There are three (3) kinds of genes:

Protein-coding genes
RNA-coding genes
Regulatory genes

Transcription: DNA to RNA

Necessity of an Intermediate Molecule: In eukaryotic cells, the spatial separation between the nucleus, where DNA resides, and the cytoplasm, where protein synthesis occurs, necessitates the existence of an intermediary molecule to facilitate the transfer of genetic information.

Transcription: DNA to RNA

Transcription is executed by DNA-dependent RNA polymerase.
It necessitates specific upstream sequences, termed promoter signals, to initiate transcription of protein-coding genes.
In eukaryotic organisms, the initial messenger RNA, or pre-mRNA, includes non-coding regions known as introns, which are excised through intron splicing processes.

Transcription (continued)

In prokaryotes, gene transcription is mediated by a single RNA polymerase.
In contrast, eukaryotic transcription involves three distinct RNA polymerases: RNA polymerase I transcribes rRNA genes, RNA polymerase II is responsible for protein-coding genes and certain small nuclear RNAs (e.g., U6), and RNA polymerase III transcribes small cytoplasmic RNA genes, including tRNA genes, as well as some small nuclear RNAs.

DNA-RNA Relationship

DNA: ... TAACCTACCGCGCCTATTACTGCCAGGAAGGAACTTGATC ...

DNA: ... TAACCTACCGCGCCTATTACTGCCAGGAAGGAACTTGATC ...
              |||||
RNA:          AUGGC

DNA: ... TAACCTACCGCGCCTATTACTGCCAGGAAGGAACTTGATC ...
              ||||||
RNA:          AUGGCG ...

…

DNA: ... TAACCTACCGCGCCTATTACTGCCAGGAAGGAACTTGATC ...
              ||||||||||||||||||||||||||||||
RNA:          AUGGCGCCGAUAAUGUCGGUCCUUCCUUGA

Transcription (continued)

Transcription involves a straightforward one-to-one correspondence between each nucleotide in the DNA template and the resulting RNA strand. Specifically:

G pairs with C;
A pairs with U (instead of T in DNA);
Utilizes ribonucleotides rather than deoxyribonucleotides.

The resultant molecule is termed a (pre-) messenger RNA or transcript.

Transcription (continued)

Is the entire genome transcribed?
- No, transcription initiates at specific regions known as promoters.

The consensus sequence for the core promoter in E. coli (Escherichia coli) is as follows:

TTGACA(N){16,18}TATAAT

What is the probability of this motif occurring?

Transcription (continued)

The most straightforward model is the independent and identically distributed (i.i.d.) model.
This model assumes two main principles:
1. Independence: The probability of the entire motif is the product of the probabilities of each nucleotide at its respective position, indicating no interdependence between positions.
2. Identical Distribution: The probability distribution for nucleotides remains consistent across all positions in the motif.
Typically, maximum likelihood estimators are employed to determine these probability distributions. This involves gathering extensive sample data and using nucleotide frequency as an estimator for probability.

Promoter Prediction

Consider the canonical promoter motif:

TTGACA(N){16,18}TATAAT

Assume uniform nucleotide distribution, with probabilities \(p_A = p_C = p_G = p_T = \frac{1}{4}\). Consequently, the likelihood of observing this motif is \(\frac{1}{4^{12}} \approx 6 \times 10^{-8}\).
Estimating Promoter Occurrence in E. coli:
- Given the E. coli genome size of approximately 4.6 Mb, the expected occurrences of this motif are calculated as \(6 \times 10^{-8} \times 4.6 \times 10^6 \approx 0.276\), suggesting fewer than one occurrence.

Promoter Prediction

Comparison with Eukaryotic Genomes:
- Eukaryotic genomes, with sizes often reaching billions of base pairs, exhibit greater promoter complexity, reflecting intricate regulatory needs.
Additional Regulatory Elements:
- Beyond promoters, other regulatory sequences serve as binding sites for transcriptional regulators, facilitating either transcriptional enhancement (positive regulation) or repression (negative regulation).

Bioinformaticist’s Perspective

The identification of novel regulatory motifs, such as promoters and signaling sequences, remains a vibrant research focus.

About the Animation

Walkthrough of the 1 minute 23 seconds animation.
Transcription factors assemble at a DNA promoter region found at the start of a gene. Promoter regions are characterised by the DNA’s base sequence, which contains the repetition TATATA and for this reason is known as the “TATA box”.

About the Animation

The TATA box is gripped by the transcription factor TFIID (yellow-brown) that marks the attachment point for RNA polymerase and associated transcription factors. In the middle of TFIID is the TATA Binding Protein subunit, which recognises and fastens onto the TATA box. It’s tight grip makes the DNA kink 90 degrees, which is thought to serve as a physical landmark for the start of a gene.

About the Animation

A mediator (purple) protein complex arrives carrying the enzyme RNA polymerase II (blue-green). It maneuvers the RNA polymerase into place. Other transcription factors arrive (TFIIA and TFIIB - small blue molecules) and lock into place. Then TFIIH (green) arrives. One of its jobs is to pry apart the two strands of DNA (via helicase action) to allow the RNA polymerase to get access to the DNA bases.

About the Animation

Finally, the initiation complex requires contact with activator proteins, which bind to specific sequences of DNA known as enhancer regions. These regions can be thousands of base pairs away from the initiation complex. The consequent bending of the activator protein/enhancer region into contact with the initiation-complex resembles a scorpion’s tail in this animation.

About the Animation

The activator protein triggers the release of the RNA polymerase, which runs along the DNA transcribing the gene into mRNA (yellow ribbon).

About the Animation

The RNA polymerase unzips a small portion of the DNA helix exposing the bases on each strand. One of the strands acts as a template for the synthesis of an RNA molecule. The base-sequence code is transcribed by matching these DNA bases with RNA subunits, forming a long RNA polymer chain.

Transcription (Detailed)

Transcriptome and Gene Regulation

In prokaryotes, mRNA degradation occurs within minutes post-synthesis, whereas in eukaryotes, it takes several hours.
Regulatory and transport functions are mediated by sequences within the untranslated regions (UTRs) of the mRNA transcript.

Prologue

Central Dogma (Futuristic Animation)

Summary

Central Dogma
DNA → RNA → Protein as the core framework of gene expression.
Key Omics
- Genome: Complete genetic blueprint
- Transcriptome: All RNA transcripts
- Proteome: All proteins

Next Lecture

Essential Cell Biology (Part 2)

References

Chervova, Olga, Lucia Conde, JoséAfonso Guerra-Assunção, Ismail Moghul, Amy P. Webster, Alison Berner, Elizabeth Larose Cadieux, et al. 2019. “The Personal Genome Project-UK, an Open Access Resource of Human Multi-Omics Data.” Scientific Data 6 (1): 257. https://doi.org/10.1038/s41597-019-0205-4.

Cohen, W., and C. Cohen. 2024. A Computer Scientist’s Guide to Cell Biology. Springer Nature Switzerland.

Crick, F. 1970. “Central Dogma of Molecular Biology.” Nature 227 (5258): 561–63. https://doi.org/10.1038/227561a0.

CRICK, F H. 1958. “On Protein Synthesis.” Symp Soc Exp Biol 12: 138–63.

Li, W.-H., and D. Graur. 1991. Fundamentals of Molecular Evolution. Sinauer.

Widła, Wiesława. 2013. Molecular Biology: Not Only for Bioinformaticians. Vol. 8248. Berlin: Springer. https://doi.org/10.1007/978-3-642-45361-8.

Marcel Turcotte

Marcel.Turcotte@uOttawa.ca

School of Electrical Engineering and Computer Science (EECS)

University of Ottawa