Welcome!

CSI 5180 - Machine Learning for Bioinformatics

Marcel Turcotte

Version: Jan 22, 2025 11:37

Preamble

Quote of the Day (1/3)

The Nobel Prize in Physics 2024 was awarded to John J. Hopfield and Geoffrey E. Hinton “for foundational discoveries and inventions that enable machine learning with artificial neural networks”

Quote of the Day (2/3)

Quote of the Day (3/3)

  • “Two papers in this week’s issue dramatically expand our structural understanding of proteins. Researchers at DeepMind, Google’s London-based sister company, present the latest version of their AlphaFold neural network.””

  • Jumper et al. (2021)

Learning Objectives

  • Clarify the proposition of this course
  • Summarize the fundamental concepts of bioinformatics
  • Provide an overview of the instructor’s expertise
  • Discuss the course syllabus in detail
  • Articulate the expectations for student outcomes

Short description

Machine Learning for Bioinformatics Applications is about the analysis of complex biological data using modern machine learning methods.

Prerequisites

While no prior knowledge of machine learning is required, a fundamental grasp of calculus, linear algebra, probability, and statistics is essential.

Additionally, proficiency in Python programming is expected.

What about biology?

  • Bioinformatics strives to solve “real-world” problems.

  • At least two lectures to fundamental molecular biology concepts.

  • These foundational concepts will be revisited as new problems arise throughout the course.

  • Participants are expected to demonstrate a keen interest in expanding their biological understanding.

Proposition

AI Detects Mutations Behind Autism

  • “Using artificial intelligence, a Princeton University-led team has decoded the functional impact of such mutations in people with autism.” Press Release

Olga Troyanskaya (Princeton)

AI Detects Mutations Behind Autism

  • “We address the challenge of detecting the contribution of noncoding mutations to disease with a deep-learning-based framework that predicts the specific regulatory effects and the deleterious impact of genetic variants.”

  • “Our predictive genomics framework illuminates the role of noncoding mutations in ASD [autism spectrum disorder] and prioritizes mutations with high impact for further study, and is broadly applicable to complex human diseases.”

  • Zhou et al. (2019)

Large Volumes of Data

  • “Together, the HMP1 and HMP2 phases have produced a total of 42 terabytes of multi-omic data.

Improving Fitness and Health

“A Brief History of Tomorrow”

Yuval Noah Harari argues that artificial intelligence and genetic engineering will play a central role shaping the future of society.

Billions on Biotech’s AI Future

About the Course

What this Course is Not

Although the following are of paramount importance, this is not what this course is about:

  • Computational Learning Theory:

    • Probably approximately correct learning (PAC Learning)
      proposed by Leslie Valiant;

    • VC theory
      proposed by Vladimir Vapnik and Alexey Chervonenkis;

    • Bayesian inference
      influenced by Judea Pearl;

    • Algorithmic learning theory
      from E. Mark Gold;

    • Online machine learning
      from Nick Littlestone.

  • Compression bounds and learnability in general.

What this Course is

  • Practical applications of machine learning to biological sequence data, gene expression, genomics and proteomics.

Philosophy (1/2)

  • The Hundred-Page Machine Learning Book (Burkov 2019) is a succinct and focused textbook that can feasibly be read in one week, making it an excellent introductory resource.
  • Available under a “read first, buy later” model, allowing readers to evaluate its content before purchasing.
  • Its author, Andriy Burkov, received his Ph.D. in AI from Université Laval.

Philosophy (2/2)

What I Would Like the Course to Be …

  • In future editions of this course:

    • Extensive set of examples

    • Practical Machine Learning Applications in Bioinformatics (textbook)

    • Hackathon, hackfest, codefest, and (friendly) competitive challenges;

    • Participation to international competitions:

  • Activity in the bioGARAGE;

  • Guests lectures.

What I Believe the Course Should Be

Cellular Molecular Biology Problems

  • Predicting protein stability changes upon mutation, intrinsically disordered protein region

  • Protein secondary and tertiary structure prediction

  • Prediction of anti-hypertensive peptides

  • Genome assembly, gene prediction, genome annotation

  • Identifying DNA landmark sites: methylation, splice site, promotors, protein binding sites, etc.

  • Prediction and prioritization of gene functional annotations.

  • Clustering and classification of non-coding RNA genes

  • Subtypes cancer classification

  • Toxicity, carcinogenicity, structure activity relationships

  • Predicting disease associations, identify robust prognostic gene signatures

  • Sub-cellular localization

Machine Learning Concepts

  • Feature Engineering, Data Imputation, Dimensionality Reduction

  • Linear and Logistic Regression

  • Decision Trees, Random Forests and eXtreme Gradient Boosting, Ensemble

  • Hidden Markov Models

  • Kernel Methods, Support Vector Machines

  • Deep Learning: Fundamentals, Embeddings, Architectures

  • Concept and Rule-based

  • Learning Graphs

  • Unsupervised Learning

  • Semi-supervised Learning

  • Automated Scientific Discovery

Learning Objectives

  • Encode and clean biological data for machine learning applications

  • Apply modern machine learning methods to solve bioinformatics problems

  • Find optimal values for the hyperparameters a given machine learning algorithm and data set

  • Use a sound methodology for your machine learning projects

  • Critically review scientific publications in this field

  • Locate and critically evaluate scientific information

  • Present scientific content to a small technical audience

About me

Professional Experience

  • 1989, Honours project, implementation of a graphical user interface for a protein folding/unfolding system

  • 1989–95, Université de Montréal, graduate studies under the direction of Guy Lapalme (IRO), Robert Cedergren (Biochemistry), work on methods for building nucleic acids’ 3-D structures

  • 1995–97, University of Florida, work with Steven A. Benner (Chemistry) on evolutionary-based approaches to predict protein secondary structure

  • 1997–00, Imperial Cancer Research Fund (London/UK), work with Michael J.E. Sternberg and Stephen H. Muggleton (York) on the application of Inductive Logic Programming to discover automatically protein folding rules

  • 2000–, University of Ottawa, work on nucleic acids secondary structure determination, motifs inference and pattern matching

Learning Protein Structure Principles

  • M. Turcotte, S.H. Muggleton, and M.J.E. Sternberg. “Application of inductive logic programming to discover rules governing the three-dimensional topology of protein structure.” In C.D. Page, editor, Proc. of the 8th International Workshop on Inductive Logic Programming (ILP-98), LNAI 1446, pages 53–64, Berlin, 1998. Springer-Verlag.

  • M.J.E. Sternberg, P.A. Bates, L.A. Kelley, R.M. MacCallum, A. Müller, S. Muggleton, and M. Turcotte. “Exploiting protein structure in the post-genome era.” In Intelligent Systems for Molecular Biology 1999, 1999. Oral Presentation.

  • M. Turcotte, S.H. Muggleton, and M.J.E. Sternberg. “Learning protein structure principles.” In The 17th Machine Intelligence Workshop, Suffolk, UK, July 19-21 2000. Oral Presentation.

  • M. Turcotte, S.H. Muggleton, and M.J.E. Sternberg. “Generating protein three-dimensional folds signatures using inductive logic programming.” In 2000 Convention of the Society for the Study of Artificial Intelligence and the Simulation of Behaviour, Birmingham, UK, April 17-20 2000. Oral Presentation.

  • Marcel Turcotte, Stephen H. Muggleton, and Michael J.E. Sternberg. “Automated discovery of structural signatures of protein fold and function.” Journal of Molecular Biology, 306(3):591–605, February 2001.

  • Marcel Turcotte, Stephen H. Muggleton, and Michael J.E. Sternberg. “Generating protein three-dimensional fold signatures using inductive logic programming.” Computers & Chemistry, 26(1):57–64, December 2001.

  • Marcel Turcotte, Stephen H. Muggleton, and Michael J.E. Sternberg. “The effect of relational background knowledge on learning of protein three-dimensional fold signatures.” Machine Learning, 43(1–2):81–95, 2001.

Annotation Concept Synthesis

  • Mikhail Jiline, Stan Matwin, and Marcel Turcotte. “Annotation Concept Synthesis and Enrichment Analysis.” Canadian AI 2010: Advances in Artificial Intelligence, 304–308, 2010.

  • Mikhail Jiline, Stan Matwin, and Marcel Turcotte. “Annotation Concept Synthesis and Enrichment Analysis: a Logic-Based Approach to the Interpretation of High-Throughput Experiments.” Bioinformatics (Oxford, England), 27(17):2391–2398, September 2011.

Relationships Between Motifs

  • Oksana Korol and Marcel Turcotte. “Learning relationships between over-represented motifs in a set of DNA sequences.” 2012 IEEE Symposium on Computational Intelligence and Computational Biology, CIBCB 2012, 2012.

Frequent Subgraph Mining (FSM)

  • Alexander R. Gawronski and Marcel Turcotte. “RiboFSM: Frequent subgraph mining for the discovery of RNA structures and interactions.” BMC bioinformatics, 15(S2), 2014.

Smart Controls

  • Aseel Awdeh, Marcel Turcotte, and Theodore J. Perkins. “WACS: Improving peak calling by optimally weighting controls.” In Great Lakes Bioinformatics Conference, GLBIO 2019, May 19–22 2019.

  • Aseel Awdeh, Marcel Turcotte, and Theodore J. Perkins. “WACS: improving ChIP-seq peak calling by optimally weighting controls.” BMC Bioinformatics, 22(1):69, 2021.

Cell-type Specific Binding Signatures

  • Aseel Awdeh, Marcel Turcotte, and Theodore J. Perkins. “Cell type specific binding preferences of transcription factors.”” In Great Lakes Bioinformatics Conference, GLBIO 2021, May 10–13 2021.

  • Aseel Awdeh, Marcel Turcotte, and Theodore J. Perkins. “Cell Type Specific DNA Signatures of Transcription Factor Binding.” In Intelligent Systems for Molecular Biology, ISMB 2022, July 10–14 2022.

  • Aseel Awdeh, Marcel Turcotte, and Theodore J. Perkins. “Identifying transcription factors with cell-type specific DNA binding signatures.” BMC Genomics 25, 957 (2024).

RNA Secondary Strcuture Fingerprints

  • Kevin Sutanto and Marcel Turcotte. “Assessing the use of secondary structure fingerprints and deep learning to classify RNA sequences.” In IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Seoul, South Korea, December 16-19, 2020.

  • Kevin Sutanto and Marcel Turcotte. “Extracting and evaluating features from RNA virus sequences to predict host species susceptibility using deep learning.” In 13th International Conference on Bioinformatics and Biomedical Technology (ICBBT 2021), Northwestern Polytechnical University, Xi’an, China, May 21-23, 2021.

  • Kevin Sutanto and Marcel Turcotte. “Assessing global-local secondary structure fingerprints to classify RNA sequences with deep learning.” IEEE/ACM Transactions on Computational Biology and Bioinformatics, 20(5):2736-2747, 2023

What is Bioinformatics?

Beginnings

“Computers and specialized software have become an essential part of the biologist’s toolkit. Either for routine DNA or protein sequence analysis or to parse meaningful information in massive gigabyte-sized biological data sets, virtually all modern research projects in biology require, to some extent, the use of computers. (…) the very beginnings of bioinformatics occurred more than 50 years ago, when desktop computers were still a hypothesis and DNA could not yet be sequenced.”

A. Isaev

“Broadly speaking, bioinformatics can be defined as a collection of mathematical, statistical and computational methods for analyzing biological sequences, that is, DNA, RNA and amino acid (protein) sequences.”

Lacroix and Critchlow

“Bioinformatics is the design and development of computer-based technology that supports life sciences. Using this definition bioinformatics tools and systems perform a diverse range of functions including: data collection, data mining, data analysis, data management, data integration, simulation, statistics, and visualization. Computer-aided technology directly supporting medical applications is excluded from this definition and is referred to as medical informatics.”

Jones N.C. and Pevzner P. A.

“Biologists that reduce bioinformatics to simply the application of computers in biology sometimes fail to recognize the rich intellectual content of bioinformatics. Bioinformatics has become a part of modern biology and often dictates new fashions, enables new approaches, and drives further biological developments.”

J.J. Ramsden

“In bioinformatics, so much is to be done, the raw material to hand is already so vast and vastly increasing, and the problems to be solved are so important (perhaps the most important of any science at present) we may be entering an era comparable to the great flowering of quantum mechanics in the first three decades of the twentieth century (…)”

SIB - Swiss Institute of Bioinformatics

What it’s Not!

Leonard Adleman (Science, December 1994) solved a particular instance of the Hamiltonian Path problem using DNA molecules!

What it’s Not! (continued)

DNA Computing is the theoretical study of the use of DNA molecules to solve challenging problems or as a new architecture (what class of problems can be solved, what are the properties, limits, etc.).

What it’s Not! (continued)

  • Biotechnology and biomedical engineering apply engineering approaches to problems dealing with biological systems.

  • Examples of biomedical engineering include developing biomedical devices for human implantation, drug delivery systems, simulation of organs and micro-fluids, medical imaging, and many more.

Bioinformatics courses on Campus

Collaborative Programs in Bioinformatics

  • Starting from January 2008, Carleton University and the University of Ottawa offers a Collaborative Program leading to an MSc degree with Specialization in Bioinformatics or MSc of Computer Science degree with Specialization in Bioinformatics;

  • Many programs also offer the specialization at the Ph.D. level.

Most Cited Publications in Science

  • Van Noorden, R., Maher, B. & Nuzzo, R. The top 100 papers. Nature 514:550–553, 2014.

  • Wren, J. D. Bioinformatics programs are 31-fold over-represented among the highest impact scientific papers of the past two decades. Bioinformatics 32(17):2686-91, September 2016.

www.bioinformatics.ca/jobs

Syllabus

Course information

Web sites

Schedule

  • Lectures: Tuesday, 11:30 to 12:50, and Friday, 13:00 to 14:20, VNR 2095

  • Office hours: Tuesday from 13:30 to 14:20 at STE 5-106

  • Official schedule: www.uottawa.ca/course-timetable

Course information

Evaluation

  • 20% — assignments (2)
  • 10% — presentation (1)
  • 30% — project (1)
  • 40% — examinations (2)

What is Machine Learning?

The Truth

  • “Let’s start by telling the truth: machines don’t learn. (…) just like artificial intelligence is not intelligence, machine learning is not learning.”

Mitchell

  • “A computer program is said to learn from experience \(E\) with respect to some class of tasks \(T\) and performance measure \(P\), if its performance at tasks in \(T\), as measured by \(P\), improves with experience \(E\).”

1959

  • “A computer can be programmed so that it will learn to play a better game of checkers than can be played by the person who wrote the program.”

  • “Programming computers to learn from experience should eventually eliminate the need for much of this detailed programming effort.”

1951

  • “Inspired by a radio talk given by Turing in 1951, Christopher Strachey went on to implement the world’s first machine learning program.”

ML in Computational Biology

  • “A machine learning algorithm is a computational method based upon statistics, implemented in software, able to discover hidden non-obvious patterns in a dataset, and moreover to make reliable statistical predictions about similar new data.”

  • “The ability [of machine learning] to automatically identify patterns in data […] is particularly important when the expert knowledge is incomplete or inaccurate, when the amount of available data is too large to be handled manually, or when there are exceptions to the general cases.”

Summary

  • Practical experience in applying machine learning techniques to biological datasets.
  • Proficiency in Python programming and a strong interest in biology are essential.

Prologue

An Introduction to the Human Genome

Going from CS to Bioinformatics

Next Lecture

  • Essential Cell Biology

References

Burkov, Andriy. 2019. The Hundred-Page Machine Learning Book. Andriy Burkov.
Chicco, Davide. 2017. “Ten Quick Tips for Machine Learning in Computational Biology.” BioData Mining 10 (1): 35. https://doi.org/10.1186/s13040-017-0155-3.
Consortium, Integrative HMP (iHMP) Research Network. 2019. The Integrative Human Microbiome Project. Nature 569 (7758): 641–48. https://doi.org/10.1038/s41586-019-1238-8.
Gauthier, Jeff, Antony T Vincent, Steve J Charette, and Nicolas Derome. 2018. A brief history of bioinformatics.” Briefings in Bioinformatics 79 (August): 137. https://doi.org/10.1093/bib/bby063.
Géron, Aurélien. 2022. Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow. 3rd ed. O’Reilly Media, Inc.
Jumper, John, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, et al. 2021. Highly accurate protein structure prediction with AlphaFold.” Nature, 1–11. https://doi.org/10.1038/s41586-021-03819-2.
Mitchell, Tom M. 1997. Machine Learning. New York: McGraw-Hill.
Muggleton, Stephen. 1994. Logic and Learning: Turing’s legacy.” In. Vol. 13. Muggleton, SH and Michie, d. Furukaw, k., Editors, Machine Intelligence.
Samuel, A. L. 1959. “Some Studies in Machine Learning Using the Game of Checkers.” IBM J. Res. Dev. 3 (3): 210–29. https://doi.org/10.1147/rd.33.0210.
Xu, Chunming, and Scott A Jackson. 2019. “Machine Learning and Complex Biological Data.” Genome Biology 20 (1): 76. https://doi.org/10.1186/s13059-019-1689-0.
Zhou, Jian, Christopher Y Park, Chandra L Theesfeld, Aaron K Wong, Yuan Yuan, Claudia Scheckel, John J Fak, et al. 2019. “Whole-Genome Deep-Learning Analysis Identifies Contribution of Noncoding Mutations to Autism Risk.” Nature Genetics 485 (6): 237–980. https://doi.org/10.1038/s41588-019-0420-0.

Marcel Turcotte

Marcel.Turcotte@uOttawa.ca

School of Electrical Engineering and Computer Science (EECS)

University of Ottawa