CSI 5180 - Machine Learning for Bioinformatics
Version: Jan 22, 2025 11:37
The Nobel Prize in Physics 2024 was awarded to John J. Hopfield and Geoffrey E. Hinton “for foundational discoveries and inventions that enable machine learning with artificial neural networks”
“Two papers in this week’s issue dramatically expand our structural understanding of proteins. Researchers at DeepMind, Google’s London-based sister company, present the latest version of their AlphaFold neural network.””
Jumper et al. (2021)
Machine Learning for Bioinformatics Applications is about the analysis of complex biological data using modern machine learning methods.
While no prior knowledge of machine learning is required, a fundamental grasp of calculus, linear algebra, probability, and statistics is essential.
Additionally, proficiency in Python programming is expected.
Bioinformatics strives to solve “real-world” problems.
At least two lectures to fundamental molecular biology concepts.
These foundational concepts will be revisited as new problems arise throughout the course.
Participants are expected to demonstrate a keen interest in expanding their biological understanding.
“We address the challenge of detecting the contribution of noncoding mutations to disease with a deep-learning-based framework that predicts the specific regulatory effects and the deleterious impact of genetic variants.”
“Our predictive genomics framework illuminates the role of noncoding mutations in ASD [autism spectrum disorder] and prioritizes mutations with high impact for further study, and is broadly applicable to complex human diseases.”
Zhou et al. (2019)
“MyExome, a new DNA test designed by Toronto entrepreneur Zaid Shahatit, claims to be able to provide a little insight into our personal quirks by testing 57 different genes that could determine our ability to metabolize certain things, sleep patterns and physical performance.”
Can a DNA test improve your fitness and health? by Christine Sismondo, The Star. July 31, 2019.
Yuval Noah Harari argues that artificial intelligence and genetic engineering will play a central role shaping the future of society.
Although the following are of paramount importance, this is not what this course is about:
Computational Learning Theory:
Probably approximately correct learning (PAC Learning)
proposed by Leslie Valiant;
VC theory
proposed by Vladimir Vapnik and Alexey Chervonenkis;
Bayesian inference
influenced by Judea Pearl;
Algorithmic learning theory
from E. Mark Gold;
Online machine learning
from Nick Littlestone.
Compression bounds and learnability in general.
In future editions of this course:
Extensive set of examples
Practical Machine Learning Applications in Bioinformatics (textbook)
Hackathon, hackfest, codefest, and (friendly) competitive challenges;
Participation to international competitions:
Activity in the bioGARAGE;
Guests lectures.
Predicting protein stability changes upon mutation, intrinsically disordered protein region
Protein secondary and tertiary structure prediction
Prediction of anti-hypertensive peptides
Genome assembly, gene prediction, genome annotation
Identifying DNA landmark sites: methylation, splice site, promotors, protein binding sites, etc.
Prediction and prioritization of gene functional annotations.
Clustering and classification of non-coding RNA genes
Subtypes cancer classification
Toxicity, carcinogenicity, structure activity relationships
Predicting disease associations, identify robust prognostic gene signatures
Sub-cellular localization
Feature Engineering, Data Imputation, Dimensionality Reduction
Linear and Logistic Regression
Decision Trees, Random Forests and eXtreme Gradient Boosting, Ensemble
Hidden Markov Models
Kernel Methods, Support Vector Machines
Deep Learning: Fundamentals, Embeddings, Architectures
Concept and Rule-based
Learning Graphs
Unsupervised Learning
Semi-supervised Learning
Automated Scientific Discovery
Encode and clean biological data for machine learning applications
Apply modern machine learning methods to solve bioinformatics problems
Find optimal values for the hyperparameters a given machine learning algorithm and data set
Use a sound methodology for your machine learning projects
Critically review scientific publications in this field
Locate and critically evaluate scientific information
Present scientific content to a small technical audience
1989, Honours project, implementation of a graphical user interface for a protein folding/unfolding system
1989–95, Université de Montréal, graduate studies under the direction of Guy Lapalme (IRO), Robert Cedergren (Biochemistry), work on methods for building nucleic acids’ 3-D structures
1995–97, University of Florida, work with Steven A. Benner (Chemistry) on evolutionary-based approaches to predict protein secondary structure
1997–00, Imperial Cancer Research Fund (London/UK), work with Michael J.E. Sternberg and Stephen H. Muggleton (York) on the application of Inductive Logic Programming to discover automatically protein folding rules
2000–, University of Ottawa, work on nucleic acids secondary structure determination, motifs inference and pattern matching
M. Turcotte, S.H. Muggleton, and M.J.E. Sternberg. “Application of inductive logic programming to discover rules governing the three-dimensional topology of protein structure.” In C.D. Page, editor, Proc. of the 8th International Workshop on Inductive Logic Programming (ILP-98), LNAI 1446, pages 53–64, Berlin, 1998. Springer-Verlag.
M.J.E. Sternberg, P.A. Bates, L.A. Kelley, R.M. MacCallum, A. Müller, S. Muggleton, and M. Turcotte. “Exploiting protein structure in the post-genome era.” In Intelligent Systems for Molecular Biology 1999, 1999. Oral Presentation.
M. Turcotte, S.H. Muggleton, and M.J.E. Sternberg. “Learning protein structure principles.” In The 17th Machine Intelligence Workshop, Suffolk, UK, July 19-21 2000. Oral Presentation.
M. Turcotte, S.H. Muggleton, and M.J.E. Sternberg. “Generating protein three-dimensional folds signatures using inductive logic programming.” In 2000 Convention of the Society for the Study of Artificial Intelligence and the Simulation of Behaviour, Birmingham, UK, April 17-20 2000. Oral Presentation.
Marcel Turcotte, Stephen H. Muggleton, and Michael J.E. Sternberg. “Automated discovery of structural signatures of protein fold and function.” Journal of Molecular Biology, 306(3):591–605, February 2001.
Marcel Turcotte, Stephen H. Muggleton, and Michael J.E. Sternberg. “Generating protein three-dimensional fold signatures using inductive logic programming.” Computers & Chemistry, 26(1):57–64, December 2001.
Marcel Turcotte, Stephen H. Muggleton, and Michael J.E. Sternberg. “The effect of relational background knowledge on learning of protein three-dimensional fold signatures.” Machine Learning, 43(1–2):81–95, 2001.
Mikhail Jiline, Stan Matwin, and Marcel Turcotte. “Annotation Concept Synthesis and Enrichment Analysis.” Canadian AI 2010: Advances in Artificial Intelligence, 304–308, 2010.
Mikhail Jiline, Stan Matwin, and Marcel Turcotte. “Annotation Concept Synthesis and Enrichment Analysis: a Logic-Based Approach to the Interpretation of High-Throughput Experiments.” Bioinformatics (Oxford, England), 27(17):2391–2398, September 2011.
Aseel Awdeh, Marcel Turcotte, and Theodore J. Perkins. “WACS: Improving peak calling by optimally weighting controls.” In Great Lakes Bioinformatics Conference, GLBIO 2019, May 19–22 2019.
Aseel Awdeh, Marcel Turcotte, and Theodore J. Perkins. “WACS: improving ChIP-seq peak calling by optimally weighting controls.” BMC Bioinformatics, 22(1):69, 2021.
Aseel Awdeh, Marcel Turcotte, and Theodore J. Perkins. “Cell type specific binding preferences of transcription factors.”” In Great Lakes Bioinformatics Conference, GLBIO 2021, May 10–13 2021.
Aseel Awdeh, Marcel Turcotte, and Theodore J. Perkins. “Cell Type Specific DNA Signatures of Transcription Factor Binding.” In Intelligent Systems for Molecular Biology, ISMB 2022, July 10–14 2022.
Aseel Awdeh, Marcel Turcotte, and Theodore J. Perkins. “Identifying transcription factors with cell-type specific DNA binding signatures.” BMC Genomics 25, 957 (2024).
Kevin Sutanto and Marcel Turcotte. “Assessing the use of secondary structure fingerprints and deep learning to classify RNA sequences.” In IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Seoul, South Korea, December 16-19, 2020.
Kevin Sutanto and Marcel Turcotte. “Extracting and evaluating features from RNA virus sequences to predict host species susceptibility using deep learning.” In 13th International Conference on Bioinformatics and Biomedical Technology (ICBBT 2021), Northwestern Polytechnical University, Xi’an, China, May 21-23, 2021.
Kevin Sutanto and Marcel Turcotte. “Assessing global-local secondary structure fingerprints to classify RNA sequences with deep learning.” IEEE/ACM Transactions on Computational Biology and Bioinformatics, 20(5):2736-2747, 2023
“Computers and specialized software have become an essential part of the biologist’s toolkit. Either for routine DNA or protein sequence analysis or to parse meaningful information in massive gigabyte-sized biological data sets, virtually all modern research projects in biology require, to some extent, the use of computers. (…) the very beginnings of bioinformatics occurred more than 50 years ago, when desktop computers were still a hypothesis and DNA could not yet be sequenced.”
“Broadly speaking, bioinformatics can be defined as a collection of mathematical, statistical and computational methods for analyzing biological sequences, that is, DNA, RNA and amino acid (protein) sequences.”
“Bioinformatics is the design and development of computer-based technology that supports life sciences. Using this definition bioinformatics tools and systems perform a diverse range of functions including: data collection, data mining, data analysis, data management, data integration, simulation, statistics, and visualization. Computer-aided technology directly supporting medical applications is excluded from this definition and is referred to as medical informatics.”
“Biologists that reduce bioinformatics to simply the application of computers in biology sometimes fail to recognize the rich intellectual content of bioinformatics. Bioinformatics has become a part of modern biology and often dictates new fashions, enables new approaches, and drives further biological developments.”
“In bioinformatics, so much is to be done, the raw material to hand is already so vast and vastly increasing, and the problems to be solved are so important (perhaps the most important of any science at present) we may be entering an era comparable to the great flowering of quantum mechanics in the first three decades of the twentieth century (…)”
Leonard Adleman (Science, December 1994) solved a particular instance of the Hamiltonian Path problem using DNA molecules!
DNA Computing is the theoretical study of the use of DNA molecules to solve challenging problems or as a new architecture (what class of problems can be solved, what are the properties, limits, etc.).
Biotechnology and biomedical engineering apply engineering approaches to problems dealing with biological systems.
Examples of biomedical engineering include developing biomedical devices for human implantation, drug delivery systems, simulation of organs and micro-fluids, medical imaging, and many more.
www.bioinformatics.uottawa.ca (32 scientists)
CSI 5126. Algorithms in bioinformatics (2000–2018)
BNF5106 Bioinformatics
BCH5101 Analysis of -omics data
Starting from January 2008, Carleton University and the University of Ottawa offers a Collaborative Program leading to an MSc degree with Specialization in Bioinformatics or MSc of Computer Science degree with Specialization in Bioinformatics;
Many programs also offer the specialization at the Ph.D. level.
Van Noorden, R., Maher, B. & Nuzzo, R. The top 100 papers. Nature 514:550–553, 2014.
Wren, J. D. Bioinformatics programs are 31-fold over-represented among the highest impact scientific papers of the past two decades. Bioinformatics 32(17):2686-91, September 2016.
Lectures: Tuesday, 11:30 to 12:50, and Friday, 13:00 to 14:20, VNR 2095
Office hours: Tuesday from 13:30 to 14:20 at STE 5-106
Official schedule: www.uottawa.ca/course-timetable
“A computer can be programmed so that it will learn to play a better game of checkers than can be played by the person who wrote the program.”
“Programming computers to learn from experience should eventually eliminate the need for much of this detailed programming effort.”
“A machine learning algorithm is a computational method based upon statistics, implemented in software, able to discover hidden non-obvious patterns in a dataset, and moreover to make reliable statistical predictions about similar new data.”
“The ability [of machine learning] to automatically identify patterns in data […] is particularly important when the expert knowledge is incomplete or inaccurate, when the amount of available data is too large to be handled manually, or when there are exceptions to the general cases.”
Marcel Turcotte
School of Electrical Engineering and Computer Science (EECS)
University of Ottawa