Bioinformatics Workshop
Day 2: From Aligned Reads to Biology
Day 2 of 2 · 4 hours
Variant Calling
RNA-seq / DESeq2
Visualization
Capstone
Welcome back! Yesterday we built the pipeline from raw FASTQ to aligned BAM.
Today we answer the biological questions: what variants does this genome carry?
Which genes are differentially expressed? And how do we communicate these findings?
Day 1 Recap
✓ Linux CLI
pipes, grep, awk, loops, subprocess, pathlib
✓ Formats
FASTA, FASTQ (Phred+33), SAM (CIGAR, FLAG), VCF
✓ QC + Align
FastQC metrics, SLIDINGWINDOW trimming, Smith-Waterman, BWA-MEM2
Output: results/sample.bam — sorted, indexed, quality-filtered.
Coverage: 30x mean, 96% genome at ≥ 10x.
You have a clean, aligned BAM file. Now we turn it into biology.
The biological questions: What genetic variants does this patient carry?
Which genes are turned on or off in treated vs control cells?
The Biological Questions
Genomic DNA analysis
"Does this patient carry a disease-causing variant?"
BAM → Variant calling (GATK) → VCF → annotation → clinical report
RNA-seq analysis
"Which genes are turned on/off in cancer vs normal?"
FASTQ → align → count → DESeq2 → DE gene list → pathway analysis
Today's capstone combines both: QC → align → variant call → DE analysis
Both analyses start from the same raw data but ask fundamentally different questions.
DNA analysis looks at permanent genetic differences (variants).
RNA analysis looks at dynamic expression changes (which genes are active).
Module 5
BAM Processing + Variant Calling
70 minutes · Notebook: 05_variant_calling.ipynb
What is a Variant?
SNV
Single nucleotide variant
A→G, C→T, etc.
Most common: 4–5M per genome
Indel
Insertion or deletion
AATG→A or A→AATG
~500K per genome
Structural
Large rearrangements, CNV, translocation
>50 bp; require special callers
Any position where your sample differs from the reference genome (GRCh38).
The human reference genome is a consensus — no individual actually has exactly this sequence.
Every person has ~4-5 million SNVs, ~500K indels, and hundreds of structural variants
relative to GRCh38.
GATK HaplotypeCaller
# Step 1: Mark duplicates
gatk MarkDuplicates \
-I results/sample.bam \
-O results/sample_markdup.bam \
-M results/sample_markdup_metrics.txt
# Step 2: Base Quality Score Recalibration (BQSR)
gatk BaseRecalibrator -I sample_markdup.bam \
-R reference/chr22.fa --known-sites dbsnp_146.vcf.gz
# Step 3: Variant calling
gatk HaplotypeCaller \
-R reference/chr22.fa \
-I results/sample_markdup_recal.bam \
-O results/sample_raw.vcf.gz \
--emit-ref-confidence GVCF
GATK HaplotypeCaller works by:
1. Finding active regions — areas with evidence of variation
2. Reassembling reads in that region using a de Bruijn graph
3. Realigning reassembled haplotypes to the reference using pair-HMM
4. Calling genotypes using a probabilistic model
This local reassembly is why GATK is much more accurate than pileup-based callers.
VCF Quality Filtering
# Raw GATK output includes both true variants and artifacts
Total variants: 4,823,104
# Hard filters (fast, good for single samples)
FILTER=PASS: 4,312,456 (89.4%)
QUAL >= 30: 4,198,901 (87.1%)
DP >= 10: 4,289,203 (88.9%)
All filters: 3,987,234 (82.7%)
Ti/Tv ratio — the quality sanity check
All variants: Ti/Tv = 1.83 ← suspicious
After filtering: Ti/Tv = 2.07 ← expected for WGS
Transitions (A↔G, C↔T) occur more frequently than transversions in the human genome
due to the chemistry of CpG methylation deamination.
The expected Ti/Tv for WGS is ~2.0, for exome ~2.5-3.0.
If your Ti/Tv is below 1.5 after filtering, you have many false positives.
VQSR (Variant Quality Score Recalibration) is better for cohort analysis.
Allele Frequency Spectrum
import matplotlib.pyplot as plt
afs = [float(v['af']) for v in passing_variants]
fig, ax = plt.subplots(figsize=(8, 4))
ax.hist(afs, bins=30, color='steelblue')
ax.axvline(0.5, color='orange',
linestyle='--', label='AF=0.5 (het)')
ax.set_xlabel('Allele Frequency')
ax.set_title('AF Distribution')
plt.show()
What to expect
Germline: bimodal — heterozygous ~0.5, homozygous ~1.0
Somatic/tumor: uniform distribution (clonal evolution)
Contamination: peak near 0.5 is too flat
The allele frequency spectrum is highly informative about the biology.
For germline calls you expect a clear bimodal distribution: heterozygous variants cluster
near AF=0.5, homozygous variants near AF=1.0.
For tumor samples, clonal evolution creates many variants at various AFs.
Variant Annotation
Functional impact
Synonymous: same amino acid (silent)
Missense: different amino acid
Nonsense: premature stop codon
Frameshift: shifts reading frame
Splice site: disrupts splicing
Annotation tools
VEP (Ensembl Variant Effect Predictor)
ANNOVAR
SnpEff
Cross-reference: ClinVar, gnomAD, COSMIC, OMIM
Finding variants is only the beginning. The hard question is: which of the 4 million
variants in this genome are clinically relevant?
gnomAD contains allele frequencies for 730,000+ exomes — if a variant is common
in the population (AF>1%), it is unlikely to be the cause of a rare disease.
Module 6
RNA-seq and DESeq2
70 minutes · Notebook: 06_rnaseq_deseq2.ipynb
DNA vs RNA: What Are We Measuring?
DNA sequencing (WGS)
What could happen
Permanent sequence variants
Same in every cell of the body
Doesn't change over time
RNA sequencing
What is happening right now
Gene expression levels
Varies by cell type, condition, time
Responds to treatment
RNA = the phenotype. DNA = the genotype. Both matter.
This is the central distinction. Two cancer cells from the same patient can have
very different gene expression even if they have the same genome.
A drug treatment changes gene expression dramatically without changing the genome.
RNA-seq captures that dynamic regulatory layer.
RNA-seq Pipeline
FASTQ (RNA reads)
↓
Splice-aware alignment: HISAT2 / STAR
↓
Count reads per gene: featureCounts / STAR --quantMode
↓
Count matrix (genes × samples)
↓
DESeq2 / edgeR: differential expression
↓
DE gene list → pathway analysis → biology
The key difference from DNA alignment: RNA comes from spliced mRNA.
HISAT2 and STAR can handle reads that span intron-exon boundaries (N in CIGAR).
featureCounts counts how many reads align to each annotated gene.
DESeq2 then does the statistical testing.
DESeq2: Median-of-Ratios Normalization
Problem: Sample A has 10M reads, Sample B has 15M reads.
Gene X: 1000 counts in A, 1200 counts in B.
Is gene X really higher in B? No — it's just a bigger library.
import numpy as np
def size_factors(counts):
# 1. Geometric mean per gene
log_c = np.log(counts[counts.min(axis=1) > 0])
log_gm = log_c.mean(axis=1, keepdims=True)
# 2. Log-ratios of each sample to reference
ratios = log_c - log_gm
# 3. Size factor = exp(median ratio)
return np.exp(np.median(ratios, axis=0))
Why median?
The median is robust to highly expressed genes.
A few genes with 100,000 counts would dominate the mean,
but barely move the median.
Simple CPM normalization (counts per million) divides by total library size.
But if one gene accounts for 30% of the library, CPM over-corrects.
DESeq2's median-of-ratios is robust: compute the ratio of each gene to its
geometric mean across samples, then take the MEDIAN of those ratios.
The median is robust to outliers like highly expressed genes.
Negative Binomial Model
Why not a t-test?
RNA-seq counts are:
Non-negative integers (0, 1, 2, ...)
Right-skewed (most genes: low; few genes: very high)
Variance > mean (overdispersion)
Normal distribution doesn't fit this.
Negative Binomial
Var(X) = μ + α·μ²
α = per-gene dispersion
DESeq2 estimates α from the data using an empirical Bayes "shrinkage" approach
RNA-seq counts are overdispersed: the variance is much higher than the mean.
For a Poisson distribution, variance = mean. For RNA-seq, variance = mean + dispersion * mean^2.
DESeq2 estimates the dispersion for each gene using a shrinkage approach that borrows
strength from genes with similar expression levels.
Volcano Plot: Visualizing DE Results
fig, ax = plt.subplots(figsize=(8, 7))
# Classify genes
up = (log2fc >= 1) & (padj < 0.05) # upregulated
down = (log2fc <= -1) & (padj < 0.05) # downregulated
# Plot
ax.scatter(log2fc[~(up|down)], -np.log10(padj[~(up|down)]),
c='lightgray', alpha=0.5, s=20, label='NS')
ax.scatter(log2fc[up], -np.log10(padj[up]),
c='#e74c3c', alpha=0.8, s=40, label=f'Up ({up.sum()})')
ax.scatter(log2fc[down], -np.log10(padj[down]),
c='#3498db', alpha=0.8, s=40, label=f'Down ({down.sum()})')
# Threshold lines
ax.axhline(-np.log10(0.05), color='gray', linestyle='--')
ax.axvline(1, color='gray', linestyle=':')
ax.axvline(-1, color='gray', linestyle=':')
The volcano plot is the standard way to present DE results.
X-axis: log2 fold change (effect size).
Y-axis: -log10(adjusted p-value) (statistical significance).
Top-right: upregulated, significantly.
Top-left: downregulated, significantly.
Flat bottom: not significant.
ALWAYS use adjusted p-values (Benjamini-Hochberg FDR), not raw p-values.
With 20,000 genes, you expect 1,000 false positives at p < 0.05 by chance.
Interpreting DE Results
What counts as DE?
padj < 0.05 (FDR controlled)
|log2FC| ≥ 1 (≥2-fold change)
Both thresholds matter!
Never filter on raw p-value alone.
PCA check
Before DE testing:
PC1 should separate conditions
Replicates should cluster tightly
Outliers → investigate or remove
Batch effects → ComBat / limma
The log2FC threshold of 1 is conventional, not sacred.
For some applications (drug discovery) you might require log2FC >= 2.
For others (biomarker discovery) you might accept log2FC >= 0.5.
Always check PCA BEFORE running DE. Outlier samples can dominate the results.
Module 7
Visualization
50 minutes · Notebook: 07_visualization.ipynb
Why Visualization Matters
"A result no one can understand is not a result."
Coverage plot
See reads pile up at each genomic position. Find CNVs, deletions, low-coverage regions.
Volcano + Heatmap
Communicate DE results clearly. Which genes? How much? In which direction?
PCA + Pathway
Big-picture: sample clustering, biological processes enriched.
Every figure in every genomics paper uses these exact plot types.
The difference between a paper that gets published in Nature and one that gets rejected
is often the quality and clarity of the figures.
Coverage Plots
fig, axes = plt.subplots(3, 1, figsize=(14, 9), sharex=True)
# Coverage depth
axes[0].fill_between(positions, coverage, color='steelblue', alpha=0.7)
axes[0].set_title('Coverage Depth')
# Log2 ratio (tumor/normal — detects copy number changes)
log2_ratio = np.log2((cov_tumor + 1) / (cov_normal + 1))
axes[1].bar(positions, log2_ratio,
color=np.where(log2_ratio > 0, '#e74c3c', '#3498db'))
axes[1].axhline(0, color='black', linewidth=0.8)
axes[1].set_title('log2(tumor / normal) — CNV track')
# Gene annotation track
for exon_start, exon_end in exons:
axes[2].fill_between([exon_start, exon_end], [0], [1], color='steelblue')
axes[2].set_title('Gene model')
Coverage plots are what you see in IGV (Integrative Genomics Viewer).
The log2 ratio track is the foundation of copy number variant calling.
A log2 ratio of +1 means 4 copies (2x amplification), -1 means 1 copy (deletion).
Expression Heatmap
import numpy as np
# Z-score normalize: (x - mean) / std per gene
z = (matrix - matrix.mean(axis=1, keepdims=True)) \
/ matrix.std(axis=1, keepdims=True)
fig, ax = plt.subplots(figsize=(9, 10))
im = ax.imshow(z, cmap='RdBu_r',
aspect='auto',
vmin=-2.5, vmax=2.5)
plt.colorbar(im, label='Z-score')
# Separate conditions with a line
ax.axvline(n_ctrl - 0.5,
color='black', linewidth=2)
Design choices
Z-score — normalizes per gene so all are on the same scale
RdBu_r — red=high, blue=low (common convention)
Clustering — group genes with similar patterns
Color blind — consider viridis or adding patterns
Z-score normalization is essential for heatmaps.
Without it, a gene with 10,000 counts will dominate the color scale
and you won't see any variation in genes with 100 counts.
Z-score puts every gene on a scale of standard deviations from its own mean.
PCA: Your First QC Plot
import numpy as np
# PCA on log2-normalized counts
X = log_cpm.T # samples as rows
X_c = X - X.mean(axis=0)
# SVD for PCA
U, S, Vt = np.linalg.svd(X_c,
full_matrices=False)
pcs = U * S # sample scores
var = (S**2) / (S**2).sum() * 100
ax.scatter(pcs[:, 0], pcs[:, 1])
ax.set_xlabel(f'PC1 ({var[0]:.1f}%)')
ax.set_ylabel(f'PC2 ({var[1]:.1f}%)')
What to look for
Conditions separate on PC1
Replicates cluster tightly
Outlier sample → investigate
No separation → no DE signal
Unexpected clustering → batch effect
PCA should always be your FIRST plot when you get an RNA-seq dataset.
If your treated and control samples don't separate on PC1, there is no DE signal
to find, no matter what statistical test you run.
Unexpected clustering often reveals batch effects: samples processed on different
days, by different people, or in different labs.
Pathway Enrichment
Over-representation analysis (ORA)
Given DE genes, which GO terms / KEGG pathways are enriched beyond chance?
Hypergeometric test:
P(k or more DE genes in pathway of size m, from universe of N genes with K DE)
GSEA
Gene Set Enrichment Analysis: use ALL genes ranked by fold change, not just those passing a threshold.
More sensitive than ORA for subtle but coordinated gene expression changes.
Tools: clusterProfiler (R), gProfiler2, DAVID, g:Profiler
Pathway analysis turns a list of gene names into a biological story.
Instead of saying "BRCA1 is DE", you can say "DNA repair pathway is activated".
GSEA is generally preferred because ORA requires an arbitrary cutoff for "DE".
Module 8
Capstone Challenge
50 minutes · Notebook: 08_capstone_exercises.ipynb
The Challenge
Challenge A
QC + Trim
Load 1000 simulated reads.
QC them. Find adapters. Trim.
Report before/after stats.
Challenge B
Variants
Parse a 50-variant VCF.
Apply quality filters.
Compute Ti/Tv. Plot the landscape.
Challenge C
DESeq2
Normalize 100 genes × 6 samples.
Run DE testing.
Volcano plot + heatmap.
Data: data/capstone/ — three synthetic files, realistic properties
Time: 40 minutes independent work + 10 minutes group discussion
Everything you need is in the earlier modules. The functions are built.
This is about integration: using them together, in the right order,
interpreting the outputs, and communicating the findings.
Solutions are in 08_capstone_solutions.ipynb — only look after trying hard.
Capstone Data
File Content Key features
synthetic_reads.fastq
1000 reads, 150 bp
Quality degradation, ~10% Illumina adapter contamination
synthetic_variants.vcf
50 SNVs on chr22
Ti/Tv ≈ 2.0, QUAL 10–800, 4 filtered variants
synthetic_counts.tsv
100 genes × 6 samples
3 ctrl + 3 treated; 20 genes are truly DE (fold change 2–8x)
All generated with generate_capstone_data.py (seed=42) — fully reproducible.
The capstone data is designed to be realistic but tractable.
The 20 DE genes have log2FC ranging from about -3 to +3.
Your normalized analysis should recover most of them.
Don't worry if you don't get all 20 — sensitivity/specificity tradeoffs are real.
What to Submit
Fill in the final summary cell in
08_capstone_exercises.ipynb:
Metric Your result
Input reads / reads after trim ___
Adapter contamination % ___
High-confidence variants ___
Ti/Tv ratio (passing) ___
DE genes (padj < 0.05) ___
Most significantly upregulated gene ___
Don't worry about getting perfect numbers. The learning is in the process.
Debugging a broken pipeline tells you more than a pipeline that just works.
Workshop Recap
Day 1
Linux CLI, subprocess, pathlib
FASTA/FASTQ/SAM/VCF formats
FastQC metrics, Trimmomatic
Smith-Waterman, BWA-MEM2
Day 2
GATK variant calling, VCF filtering
DESeq2 normalization + NB testing
Volcano, heatmap, PCA, pathway plots
Full end-to-end capstone
In two days, you have covered the complete standard genomics analysis pipeline.
The tools change (new versions every year), but the concepts are stable.
You now have the foundation to read any genomics paper and understand their methods.
Where to Go From Here
Deepen
Bioinformatics Algorithms (Compeau & Pevzner)
GATK Best Practices documentation
DESeq2 vignette
Practice
SRA public datasets
ENCODE project
TCGA cancer data
gnomAD browser
Build
Nextflow / Snakemake pipelines
Containerize with Docker
Run on HPC / AWS / GCP
Contribute to Bioconductor
The field moves fast. The best way to keep up is to work on real data.
Find a dataset you're curious about on SRA and run it through the pipeline you built today.
Bioinformatics is learned by doing.
Thank You
You built a complete genomics pipeline from scratch.
Notebooks: Available in your workshop directory
Slides: Open day1_slides.html or day2_slides.html in any browser
Press S: View speaker notes
Questions, feedback, collaboration: contact the instructors
Thank you for your participation. The notebooks are yours — run them, modify them,
break them, fix them. That's how you learn.
Good luck with your bioinformatics journey.