CSI 5180 - Machine Learning for Bioinformatics
Version: Feb 10, 2025 09:12
The lecture gives an overview of the available resources that are essential for bioinformatics projects. This includes the main databases, software applications, programming languages and computing environments. We also emphasize the skills that are essential to produce robust and reproducible results.
General objective:
You are advised not to hastily install all applications discussed today, as our goal is to review best practices. Only a specific subset of these tools will be necessary for your assignments and projects. Detailed instructions regarding the required tools will be provided for each task.
Nonetheless, proficiency in Jupyter Notebooks and Google Colab is recommended due to their anticipated utility in your coursework.
For those needing a refresher, the official tutorial on Python.org is a good place to start.
Simultaneously enhance your skills by creating a Jupyter Notebook that incorporates examples and notes from the tutorial.
Other resources include:
A notebook is a shareable document that combines computer code, plain language descriptions, data, rich visualizations like 3D models, charts, graphs and figures, and interactive controls. A notebook, along with an editor (like JupyterLab), provides a fast interactive environment for prototyping and explaining code, exploring and visualizing data, and sharing ideas with others.
Assuming the notebook is in the current directory, execute the following command from the terminal.
Similarly, to create a new notebook from scratch,
Ease of Use: The interface is intuitive and conducive to exploratory analysis.
Visualization: The capability to embed rich, interactive visualizations directly within the notebook enhances its utility for data analysis and presentation.
Reproducibility: Jupyter Notebooks have become the de facto standard in many domains for demonstrating code functionality and ensuring reproducibility. Suggested reading: Samuel and Mietchen (2024).
We will employ numerous libraries, such as NumPy, Pandas, Scikit-learn, Keras, TensorFlow or PyTorch, Matplotlib, and Seaborn, among others.
The installation process for these libraries involves dependencies that total around 100 additional packages, potentially causing conflicts with existing projects.
These instructions use pip
, the recommended installation tool for Python.
The initial step is to verify that you have a functioning Python installation with pip installed.
Installing JupyterLab
with pip
:
Once installed, run JupyterLab
with:
Launching 02_interactive_3d_viewert
in Colab.
By default, Jupyter Notebooks store the outputs of code cells, including media objects.
Jupyter Notebooks are JSON documents, and images within them are encoded in PNG base64 format.
This encoding can lead to several issues when using version control systems, such as GitHub.
Important
Do not attempt to install these tools unless you are confident in your technical skills. An incorrect installation could waste significant time or even render your environment unusable. There is nothing wrong with using pip
or Google Colab for your coursework. You can develop these installation skills later without impacting your grades.
conda
, facilitate the creation of virtual environments tailored to specific projects.Anaconda is a comprehensive package management platform for Python and R. It utilizes Conda to manage packages, dependencies, and environments.
Anaconda is advantageous as it comes pre-installed with over 250 popular packages, providing a robust starting point for users.
However, this extensive distribution results in a large file size, which can be a drawback.
Additionally, since Anaconda relies on conda
, it also inherits the limitations and issues associated with conda
(see subsequent slides).
$ conda create -n csi5180
$ conda create -n csi5180 python=3.10
$ conda install -n csi5180 keras
$ conda activate csi5180
$ conda install bwa
$ conda update --all
$ conda deactivate
$ conda remove --name csi5180 --all
Miniconda is a minimal version of Anaconda that includes only conda
, Python, their dependencies, and a small selection of essential packages.
Conda is an open-source package and environment management system for Python and R. It facilitates the installation and management of software packages and the creation of isolated virtual environments.
Dependency conflicts due to complex package interdependencies can force the user reinstall Anaconda/Conda.
Plague with large storage requirements and performance issues during package resolution.
Mamba is a reimplementation of the conda
package manager in C++.
conda
.conda
, making it a viable replacement.Micromamba is a fully statically-linked, self-contained executable. Its empty base environment ensures that the base is never corrupted, eliminating the need for reinstallation.
Both, Bioinformatics and Machine Learning, favor UNIX.
Quoting François Cholette (Deep Learning with Python):
Modularity
“This is the Unix philosophy: Write programs that do one thing and do it well. Write programs to work together. Write programs to handle text streams, because that is a universal interface.” — Doug McIlory
The file system plays a central role.
/dev/null
, /dev/random
, /dev/zeroThe command line
Shell (anatomy of a script, the magic line, and more)
Annotated/assembled nucleotide sequence
See also: International Nucleotide Sequence Database Collaboration (www.insdc.org)
Each year, NAR, a high-impact journal, publishes its “database issue”:
LOCUS NM_000020 4177 bp mRNA linear PRI 16-SEP-2019
DEFINITION Homo sapiens activin A receptor like type 1 (ACVRL1), transcript
variant 1, mRNA.
ACCESSION NM_000020
VERSION NM_000020.3
KEYWORDS RefSeq; RefSeq Select.
SOURCE Homo sapiens (human)
ORGANISM Homo sapiens
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
Catarrhini; Hominidae; Homo.
REFERENCE 1 (bases 1 to 4177)
AUTHORS Leng H, Zhang Q and Shi L.
TITLE [Gene diagnosis and treatment of hereditary hemorrhagic
(...)
(...)
FEATURES Location/Qualifiers
source 1..4177
/organism="Homo sapiens"
/mol_type="mRNA"
/db_xref="taxon:9606"
/chromosome="12"
/map="12q13.13"
gene 1..4177
/gene="ACVRL1"
/gene_synonym="ACVRLK1; ALK-1; ALK1; HHT; HHT2; ORW2;
SKR3; TSR-I"
/note="activin A receptor like type 1"
/db_xref="GeneID:94"
/db_xref="HGNC:HGNC:175"
/db_xref="MIM:601284"
exon 1..192
/gene="ACVRL1"
/gene_synonym="ACVRLK1; ALK-1; ALK1; HHT; HHT2; ORW2;
(...)
(...)
ORIGIN
1 cccagtcccg ggaggctgcc gcgccagctg cgccgagcga gcccctcccc ggctccagcc
61 cggtccgggg ccgcgcccgg accccagccc gccgtccagc gctggcggtg caactgcggc
121 cgcgcggtgg aggggaggtg gccccggtcc gccgaaggct agcgccccgc cacccgcaga
181 gcgggcccag agggaccatg accttgggct cccccaggaa aggccttctg atgctgctga
241 tggccttggt gacccaggga gaccctgtga agccgtctcg gggcccgctg gtgacctgca
(...)
4081 aaattacact tctcgtacct ggagacgctg tttgtgggag cactgggctc atgcctggca
4141 cacaataggt ctgcaataaa ccatggttaa atcctga
//
>NM_000020.3 Homo sapiens activin A receptor like type 1 (ACVRL1), transcript variant 1, mRNA
CCCAGTCCCGGGAGGCTGCCGCGCCAGCTGCGCCGAGCGAGCCCCTCCCCGGCTCCAGCCCGGTCCGGGG
CCGCGCCCGGACCCCAGCCCGCCGTCCAGCGCTGGCGGTGCAACTGCGGCCGCGCGGTGGAGGGGAGGTG
GCCCCGGTCCGCCGAAGGCTAGCGCCCCGCCACCCGCAGAGCGGGCCCAGAGGGACCATGACCTTGGGCT
CCCCCAGGAAAGGCCTTCTGATGCTGCTGATGGCCTTGGTGACCCAGGGAGACCCTGTGAAGCCGTCTCG
GGGCCCGCTGGTGACCTGCACGTGTGAGAGCCCACATTGCAAGGGGCCTACCTGCCGGGGGGCCTGGTGC
ACAGTAGTGCTGGTGCGGGAGGAGGGGAGGCACCCCCAGGAACATCGGGGCTGCGGGAACTTGCACAGGG
AGCTCTGCAGGGGGCGCCCCACCGAGTTCGTCAACCACTACTGCTGCGACAGCCACCTCTGCAACCACAA
CGTGTCCCTGGTGCTGGAGGCCACCCAACCTCCTTCGGAGCAGCCGGGAACAGATGGCCAGCTGGCCCTG
ATCCTGGGCCCCGTGCTGGCCTTGCTGGCCCTGGTGGCCCTGGGTGTCCTGGGCCTGTGGCATGTCCGAC
(...)
GGCCCAATGGCCAGGGAGTGAAGGAGGTGGCGTTGCTGAGAGCAGTCTGCACATGCTTCTGTCTGAGTGC
AGGAAGGTGTTCCAGGGTCGAAATTACACTTCTCGTACCTGGAGACGCTGTTTGTGGGAGCACTGGGCTC
ATGCCTGGCACACAATAGGTCTGCAATAAACCATGGTTAAATCCTGA
3 columns:
chr7 127471196 127472363
chr7 127472363 127473530
chr7 127473530 127474697
6 columns:
chr1 134212701 134230065 Nuak2 8 +
chr1 134212701 134230065 Nuak2 7 +
chr1 33510655 33726603 Prim2, 14 -
chr1 25124320 25886552 Bai3, 31 -
“Collectively, the bedtools utilities are a swiss-army knife of tools for a wide-range of genomics analysis tasks. The most widely-used tools enable genome arithmetic: that is, set theory on the genome. For example, bedtools allows one to intersect, merge, count, complement, and shuffle genomic intervals from multiple files in widely-used genomic file formats such as BAM, BED, GFF/GTF, VCF.”
Installing twoBitToFa
.
Downloading the mouse genome (assembly 9).
Given genes.bed:
chr1 134212701 134230065 Nuak2 8 +
chr1 134212701 134230065 Nuak2 7 +
chr1 33510655 33726603 Prim2 14 -
chr1 25124320 25886552 Bai3 31 -
chr1 134210701 134212701 Nuak2 8 +
chr1 134210701 134212701 Nuak2 7 +
chr1 33726603 33728603 Prim2 14 -
chr1 25886552 25888552 Bai3 31 -
>chr1:134210701-134212701
TTCTGGCACTTGGTTGTTCT...GTTTTATAGCAATTCGGAAC
>chr1:134210701-134212701
TTCTGGCACTTGGTTGTTCT...GTTTTATAGCAATTCGGAAC
>chr1:33726603-33728603
TCTCCCAGTGGCGGGAGAGT...ATTTATTTTTATGTTTATAA
>chr1:25886552-25888552
TTGCGCCTTATCCAAGTGAA...TCCCAGGAACAAATCACCAG
01_get_data.sh
/tmp/
, this is temporary storage for the operating system, and sometimes the partition is rather small./var/tmp/
or a designated space, such as /scratch
.
#! /bin/bash
# Sample Bash script to download a genome and extract information
INPUT=genes.bed
if [ ! -f $INPUT ]; then
echo "file not found: $INPUT"
exit 1
fi
PROJECT=csi5180-demo
# Process ID and time stamp as suffix
TMP_DIR=/var/tmp/$PROJECT-`date +"%FT%H%M%S"`-$$
if [ -d TMP_DIR ]; then
echo "$TMP_DIR exists!"
exit 1
fi
# Creating the temporary directory
mkdir $TMP_DIR
# The URL where the mouse genome version 9 (MM9) can be found
MM9_URL=http://hgdownload.cse.ucsc.edu/goldenpath/mm9/bigZips/mm9.2bit
# Where to save the mouse genome as a fasta file
MM9_FILE_NAME=$TMP_DIR/mm9.fa
# Download an uncompress the genome
twoBitToFa -udcDir=$TMP_DIR $MM9_URL stdout > $MM9_FILE_NAME
# URL of the file containing the size of each chromosome
MM9_SIZE_URL=http://hgdownload.cse.ucsc.edu/goldenPath/mm9/bigZips/mm9.chrom.sizes
MM9_SIZE_FILE_NAME=$TMP_DIR/mm9.chromsizes
# Downloading the size file (to the current directory)
curl $MM9_SIZE_URL > $MM9_SIZE_FILE_NAME
Representational State Transfer (REST)
>ENST00000288602.11
CCGCTCGGGCCCCGGCTCTCGGTTATAAGATGGCGGCGCTGAGCGGTGGCGGTGGTGGCG
GCGCGGAGCCGGGCCAGGCTCTGTTCAACGGGGACATGGAGCCCGAGGCCGGCGCCGGCG
CCGGCGCCGCGGCCTCTTCGGCTGCGGACCCTGCCATTCCGGAGGAGGTGTGGAATATCA
AACAAATGATTAAGTTGACACAGGAACATATAGAGGCCCTATTGGACAAATTTGGTGGGG
AGCATAATCCACCATCAATATATCTGGAGGCCTATGAAGAATACACCAGCAAGCTAGATG
CACTCCAACAAAGAGAACAACAGTTATTGGAATCTCTGGGGAACGGAACTGATTTTTCTG
TTTCTAGCTCTGCATCAATGGATACCGTTACATCTTCTTCCTCTTCTAGCCTTTCAGTGC
TACCTTCATCTCTTTCAGTTTTTCAAAATCCCACAGATGTGGCACGGAGCAACCCCAAGT
CACCACAAAAACCTATCGTTAGAGTCTTCCTGCCCAACAAACAGAGGACAGTGGTACCTG
CAAGGTGTGGAGTTACAGTCCGAGACAGTCTAAAGAAAGCACTGATGATGAGAGGTCTAA
TCCCAGAGTGCTGTGCTGTTTACAGAATTCAGGATGGAGAGAAGAAACCAATTGGTTGGG
ACACTGATATTTCCTGGCTTACTGGAGAAGAATTGCATGTGGAAGTGTTGGAGAATGTTC
CACTTACAACACACAACTTTGTACGAAAAACGTTTTTCACCTTAGCATTTTGTGACTTTT
GTCGAAAGCTGCTTTTCCAGGGTTTCCGCTGTCAAACATGTGGTTATAAATTTCACCAGC
GTTGTAGTACAGAAGTTCCACTGATGTGTGTTAATTATGACCAACTTGATTTGCTGTTTG
TCTCCAAGTTCTTTGAACACCACCCAATACCACAGGAAGAGGCGTCCTTAGCAGAGACTG
CCCTAACATCTGGATCATCCCCTTCCGCACCCGCCTCGGACTCTATTGGGCCCCAAATTC
TCACCAGTCCGTCTCCTTCAAAATCCATTCCAATTCCACAGCCCTTCCGACCAGCAGATG
AAGATCATCGAAATCAATTTGGGCAACGAGACCGATCCTCATCAGCTCCCAATGTGCATA
TAAACACAATAGAACCTGTCAATATTGATGACTTGATTAGAGACCAAGGATTTCGTGGTG
ATGGAGCCCCTTTGAACCAGCTGATGCGCTGTCTTCGGAAATACCAATCCCGGACTCCCA
GTCCCCTCCTACATTCTGTCCCCAGTGAAATAGTGTTTGATTTTGAGCCTGGCCCAGTGT
TCAGAGGATCAACCACAGGTTTGTCTGCTACCCCCCCTGCCTCATTACCTGGCTCACTAA
CTAACGTGAAAGCCTTACAGAAATCTCCAGGACCTCAGCGAGAAAGGAAGTCATCTTCAT
CCTCAGAAGACAGGAATCGAATGAAAACACTTGGTAGACGGGACTCGAGTGATGATTGGG
AGATTCCTGATGGGCAGATTACAGTGGGACAAAGAATTGGATCTGGATCATTTGGAACAG
TCTACAAGGGAAAGTGGCATGGTGATGTGGCAGTGAAAATGTTGAATGTGACAGCACCTA
CACCTCAGCAGTTACAAGCCTTCAAAAATGAAGTAGGAGTACTCAGGAAAACACGACATG
TGAATATCCTACTCTTCATGGGCTATTCCACAAAGCCACAACTGGCTATTGTTACCCAGT
GGTGTGAGGGCTCCAGCTTGTATCACCATCTCCATATCATTGAGACCAAATTTGAGATGA
TCAAACTTATAGATATTGCACGACAGACTGCACAGGGCATGGATTACTTACACGCCAAGT
CAATCATCCACAGAGACCTCAAGAGTAATAATATATTTCTTCATGAAGACCTCACAGTAA
AAATAGGTGATTTTGGTCTAGCTACAGTGAAATCTCGATGGAGTGGGTCCCATCAGTTTG
AACAGTTGTCTGGATCCATTTTGTGGATGGCACCAGAAGTCATCAGAATGCAAGATAAAA
ATCCATACAGCTTTCAGTCAGATGTATATGCATTTGGAATTGTTCTGTATGAATTGATGA
CTGGACAGTTACCTTATTCAAACATCAACAACAGGGACCAGATAATTTTTATGGTGGGAC
GAGGATACCTGTCTCCAGATCTCAGTAAGGTACGGAGTAACTGTCCAAAAGCCATGAAGA
GATTAATGGCAGAGTGCCTCAAAAAGAAAAGAGATGAGAGACCACTCTTTCCCCAAATTC
TCGCCTCTATTGAGCTGCTGGCCCGCTCATTGCCAAAAATTCACCGCAGTGCATCAGAAC
CCTCCTTGAATCGGGCTGGTTTCCAAACAGAGGATTTTAGTCTATATGCTTGTGCTTCTC
CAAAAACACCCATCCAGGCAGGGGGATATGGTGCGTTTCCTGTCCACTGAAACAAATGAG
TGAGAGAGTTCAGGAGAGTAGCAACAAAAGGAAAATAAATGAACATATGTTTGCTTATAT
GTTAAATTGAATAAAATACTCTCTTTTTTTTTAAGGTGAAC
A Python script can also be made executable.
Marcel Turcotte
School of Electrical Engineering and Computer Science (EECS)
University of Ottawa