Bioinformatics Workshop
Day 1: Raw Data to Aligned Reads
2 days · 4 hours each · Hands-on, tutorial-oriented
Linux CLI
Sequence Formats
Quality Control
Alignment
Welcome! Today we cover the first half of a genomics pipeline — from raw FASTQ files
to aligned BAM files. By the end of today you will have run (in simulation) every step
that a clinical genomics lab runs on patient samples. Questions welcome throughout.
Goals for Today
Navigate the Linux command line confidently
Understand what's inside FASTA, FASTQ, SAM, and VCF files
Assess read quality and apply trimming
Understand Smith-Waterman and how BWA uses it
Format: Each module = 10 min lecture + 30 min notebook exercises.
Run cells as we go — errors are expected and educational.
We learn by doing. Don't worry about memorizing commands — focus on understanding
why each step exists. The Jupyter notebooks are yours to keep.
The Bioinformatics Pipeline
Patient / Sample
↓
DNA/RNA extraction
↓
Library preparation
↓
Sequencing → FASTQ files ← YOU ARE HERE
↓
Quality control + trimming Module 3
↓
Alignment to reference Module 4
↓
Variant calling / expression quantification Day 2
↓
Biological interpretation
This is the standard genomics pipeline used by every major sequencing center in the world.
The UK Biobank, gnomAD, TCGA — all used variants of exactly this workflow.
Why This Matters
Scale
1000 Genomes: 2,500 genomes, 84 TB raw data.
UK Biobank: 500,000 participants.
Medicine
Every hospital will sequence patients routinely by 2030.
FDA-approved genomic tests: 75+ and rising.
Jobs
Bioinformatics: fastest-growing field in biology.
Median salary: $95K–$140K.
This is not niche — it is foundational to modern medicine.
Module 1
Linux CLI for Bioinformatics
45 minutes · Notebook: 01_linux_cli.ipynb
Every major bioinformatics tool — BWA, GATK, SAMtools, Trimmomatic — is command-line only.
There is no point-and-click genomics at scale.
Why the Terminal?
All major tools: bwa, samtools, gatk, trimmomatic — CLI only
Pipelines process hundreds of samples — no GUI can do that
Reproducibility: your script is your methods section
SSH into HPC clusters: only a terminal is available
Even if you primarily write Python or R, you need the CLI for running tools,
checking outputs, and moving data around on a cluster.
Essential Commands
Navigation
pwd # where am I?
ls -lh # list with sizes
cd data/ # change directory
mkdir -p results/qc
find . -name "*.fastq"
Inspect files
head -n 8 sample.fastq
wc -l sample.fastq # count lines
file sample.fastq.gz # identify type
Pipes
# Count reads in FASTQ
grep -c '^@' sample.fastq
# Chromosome counts in BED
cut -f1 genes.bed | sort | uniq -c | sort -rn
# Process gzip without decompressing
zcat sample.fastq.gz | head -n 8
The pipe | is the most powerful concept in Unix. It chains commands so that
the output of one becomes the input of the next. This is how you process
gigabytes of data without loading it all into memory.
Shell Scripts = Reproducible Pipelines
#!/usr/bin/env bash
set -euo pipefail # fail on errors, undefined vars, pipe failures
SAMPLES=(SRR001 SRR002 SRR003)
REF="reference/chr22.fa"
THREADS=8
mkdir -p results/alignment
for SAMPLE in "${SAMPLES[@]}"; do
echo "[$(date)] Aligning ${SAMPLE}..."
bwa mem -t ${THREADS} ${REF} \
data/${SAMPLE}.fastq.gz \
| samtools sort -o results/alignment/${SAMPLE}.bam
samtools index results/alignment/${SAMPLE}.bam
echo "[$(date)] Done: ${SAMPLE}"
done
This script is your methods section. Version control it in git.
set -euo pipefail is the single most important line in any bash script.
Without it, errors are silently ignored and you get wrong results.
-e: exit on error. -u: error on undefined variable. -o pipefail: fail if any command in a pipe fails.
Python + Shell: The Best of Both
import subprocess
from pathlib import Path
def run(cmd):
result = subprocess.run(cmd, shell=True, capture_output=True, text=True)
return result.stdout
# Count reads in a FASTQ file
n_reads = int(run("grep -c '^@' sample.fastq").strip())
print(f"Reads: {n_reads}")
# Build output paths safely
def output_bam(fastq: Path, outdir: Path) -> Path:
stem = fastq.name.replace('.fastq.gz', '').replace('.fastq', '')
return outdir / f"{stem}.bam"
print(output_bam(Path("SRR001.fastq.gz"), Path("results")))
Use Python to orchestrate and analyze, bash for running bioinformatics tools.
pathlib.Path is safer than string concatenation — it handles slashes correctly
on all operating systems.
Module 2
Sequence Data Formats
45 minutes · Notebook: 02_sequence_formats.ipynb
Format Landscape
Format Contains Typical size Tool
FASTA Sequence only Reference: 3 GB BLASTn, BWA index
FASTQ Sequence + quality 1–20 GB/sample Trimmomatic, BWA
SAM/BAM Aligned reads 5–30 GB SAMtools, GATK
VCF Variants 100 MB–10 GB GATK, bcftools
These 4 formats are the lingua franca of genomics. If you know these 4 formats
deeply, you can work with any genomics pipeline.
FASTQ Format
@read_001 chr22:sim length=150 ← header (@ prefix)
ACGTACGTACGTACGTACGTACGTACGT... ← sequence (150 bases)
+ ← separator
IIIIHHGG::IIIIIIIIIIIIIIIIII... ← quality (ASCII Phred+33)
Phred Quality Score
Q = −10 log₁₀(Perror )
Score Error rate Accuracy ASCII char
Q10 1 in 10 90% '+' (43)
Q20 1 in 100 99% '5' (53)
Q30 1 in 1,000 99.9% '?' (63)
Q40 1 in 10,000 99.99% 'I' (73)
Q30 is the standard quality threshold. Most downstream analyses filter out bases with Q < 20.
The ASCII encoding is Phred + 33, so Q30 = ASCII 63 = '?'.
This is why quality strings look like random punctuation.
SAM Format
# Header
@HD VN:1.6 SO:coordinate
@SQ SN:chr22 LN:50818468
# Alignment record
QNAME FLAG RNAME POS MAPQ CIGAR RNEXT PNEXT TLEN SEQ QUAL TAGS
read001 0 chr22 17000001 60 75M2I73M * 0 0 ACG III NM:i:2
CIGAR string: 75M2I73M
75 bp match · 2 bp insertion in query · 73 bp match
M=match, I=insertion, D=deletion, S=soft clip, N=intron skip
The FLAG field is a bitmask. FLAG=16 means reverse strand. FLAG=4 means unmapped.
The CIGAR string tells you exactly how the read aligns — mismatches, insertions,
deletions, and clipped regions.
VCF Format
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Read depth">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE1
chr22 17000100 rs12345 A G 856.3 PASS DP=45;AF=0.52 GT:DP 0/1:45
GT field
0/1 = heterozygous (one ref, one alt)
1/1 = homozygous alt
0/0 = homozygous ref
Key INFO fields
DP = total depth
AF = allele frequency
MQ = mapping quality
VCF is the output of variant callers like GATK. One VCF file can contain
millions of variants across hundreds of samples. The INFO and FORMAT columns
can contain dozens of fields — GATK alone outputs 30+ annotation fields.
Module 3
Quality Control + Read Trimming
60 minutes · Notebook: 03_quality_control.ipynb
Why QC?
What goes wrong
Quality degrades at 3′ end of reads
Adapter contamination (library prep artifact)
GC bias (PCR amplification skew)
Optical duplicates
Low-complexity sequences (polyA, polyG)
Consequences if ignored
Misalignments → false variant calls
Adapter reads → alignment failures
Low-quality bases → noise in pileup
Biased counts → wrong DE results
Garbage in, garbage out. QC is not optional. It is the difference between
a paper that gets rejected and one that gets published in Nature.
FastQC runs in minutes and can save weeks of downstream troubleshooting.
FastQC Dashboard
Per-base quality
Quality vs position in read.
Green: Q>28, Yellow: Q20-28, Red: Q<20
Adapter content
% reads containing adapter at each position.
>5% = definite contamination
GC distribution
Should match reference GC%.
Human genome: ~41% GC
# Run FastQC (real tool)
fastqc sample.fastq.gz -o results/qc/
# Or multiqc to aggregate many samples
multiqc results/qc/ -o results/multiqc/
FastQC generates an HTML report with these plots automatically.
MultiQC aggregates FastQC reports from hundreds of samples into one HTML.
In our notebook we implement these plots from scratch to understand them.
Trimmomatic: The Gold Standard Trimmer
trimmomatic PE \
sample_R1.fastq.gz sample_R2.fastq.gz \
sample_R1_paired.fastq.gz sample_R1_unpaired.fastq.gz \
sample_R2_paired.fastq.gz sample_R2_unpaired.fastq.gz \
ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 \ # adapter sequences
LEADING:3 \ # trim 5' if Q < 3
TRAILING:3 \ # trim 3' if Q < 3
SLIDINGWINDOW:4:20 \ # 4-base window, min mean Q=20
MINLEN:36 # discard reads < 36 bp
SLIDINGWINDOW:4:20 — scan from 5′ end;
trim from the first position where the mean quality in a 4-base window drops below 20.
In Module 3 we implement SLIDINGWINDOW in pure Python.
Understanding the algorithm helps you choose the right parameters.
Common mistake: setting min_qual too high (>30) can discard 30-40% of reads.
Before vs After Trimming
Before
150 bp reads
Quality drops at position 120+
15% contain adapter
Mean Q = 29.3
After
Mean length 128 bp
Quality uniform to read end
<0.1% adapter
Mean Q = 34.1
2% reads discarded (too short)
Rule of thumb: if >10% of reads are discarded, investigate the cause.
A good trimming run keeps 95-98% of reads. If you're discarding more,
your quality thresholds are too strict, or the library quality was genuinely poor.
Always QC both before AND after trimming.
Module 4
Read Alignment
90 minutes · Notebook: 04_alignment.ipynb
The Alignment Problem
30 million 150bp puzzle pieces → put them back into a 3 billion bp genome
Reference: ...ACGTACGTGCATGCATGC ATCGATCG...
Read: GCATGCATGC
Position: chr22:17,000,100
Challenges
Mismatches (SNPs), indels, repeat regions, splicing (RNA), structural variants
BWA-MEM2 approach
1. Seed (exact match using BWT index)
2. Extend (Smith-Waterman)
3. Assign mapping quality
The reference genome has 3 billion base pairs. Aligning 30 million reads to it
sounds like it should take forever. BWA does it in about 2 hours on a 16-core machine.
The key: the Burrows-Wheeler Transform index lets BWA find exact seed matches in O(log n) time.
Smith-Waterman: Local Alignment
def smith_waterman(query, ref, match=2, mismatch=-1, gap=-2):
m, n = len(query), len(ref)
H = np.zeros((m+1, n+1)) # scoring matrix
for i in range(1, m+1):
for j in range(1, n+1):
s = match if query[i-1] == ref[j-1] else mismatch
H[i,j] = max(
0, # local: allow restart
H[i-1, j-1] + s, # diagonal (match/mismatch)
H[i-1, j] + gap, # deletion in query
H[i, j-1] + gap, # insertion in query
)
best_pos = np.unravel_index(H.argmax(), H.shape)
return H, H[best_pos], best_pos
Smith-Waterman is O(mn) time and space — too slow for 30M reads × 3B ref.
BWA uses the BWT index to find seeds (exact matches), then only runs SW
in a small window around each seed. This makes it millions of times faster.
Smith-Waterman: Example
Query: ACGTACGT
Ref: TTTACGTACGTGGG
Alignment:
Query: ACGTACGT
||||||||
Ref: ACGTACGT (at position 3 of ref)
Score: 16 (8 × match=2)
The DP matrix is visualized as a heatmap in the notebook — hot spots = good alignments.
In the notebook exercise you will visualize the DP matrix and trace back the alignment path.
This is the key intuition: the heatmap shows where the two sequences are similar,
and the traceback follows the path of highest scores back to the origin.
From Alignment to BAM
# Index the reference genome (one-time)
bwa index reference/chr22.fa
# Align reads (produces SAM)
bwa mem -t 8 reference/chr22.fa sample.fastq.gz \
| samtools sort -o results/sample.bam
# Index the BAM
samtools index results/sample.bam
# Quick stats
samtools flagstat results/sample.bam
Expected flagstat output:
29,123,456 + 0 mapped (97.24% : N/A)
28,101,234 + 0 properly paired (93.54% : N/A)
BWA takes ~2 hours for a 30x WGS sample on 16 cores.
Mapping rate > 90% is expected. Below 80% suggests contamination or wrong reference.
Proper pair rate > 90% is expected. Below 85% suggests library prep problems.
Coverage: The Key Output Metric
Depth Application
5–10x Population studies (cheap)
30x Gold standard WGS
100x Cancer (somatic variants)
500x+ cfDNA (liquid biopsy)
Coverage uniformity
% genome covered at ≥ 10x should be > 95%.
Repeat regions, centromeres, and GC-extreme regions are always under-covered.
30x coverage means each position in the genome has been sequenced ~30 times on average.
This redundancy is not waste — it's how we distinguish true variants from sequencing errors.
At 30x, most positions have 20-50 reads, which is enough for confident genotyping.
Day 1 Summary
Module 1: CLI
pipe, grep, awk, cut
subprocess, pathlib
Module 2: Formats
FASTA, FASTQ (Phred)
SAM (CIGAR, FLAG)
VCF (FILTER, AF)
Modules 3-4: QC+Align
FastQC metrics
Trimmomatic SLIDINGWINDOW
Smith-Waterman, BWA
Tomorrow: BAM → variants → expression → visualization → capstone
Great work today. You have built the complete Day 1 pipeline from scratch in Python.
Tomorrow we take the aligned BAM file and turn it into biological insight:
variant calls, differential expression, and publication-quality figures.
See You Tomorrow
Day 2 starts at 9:00 AM
Tonight: review your notebook outputs.
Make sure your trimming pipeline produces reasonable before/after stats.
Check that your Smith-Waterman correctly traces back the alignment.
Questions? Office hours after the session or email the instructors.
Press S to exit speaker notes mode. Thank you!