Bioinformatics Workshop · Day 2

From Aligned Reads to Biology

Day 2 of 2 · 4 hours

Variant Calling RNA-seq / DESeq2 Visualization Capstone

Press S for speaker notes · F for fullscreen · → to begin

Day 1 Recap

✓ Linux CLI

pipes, grep, awk, loops, subprocess, pathlib

✓ Formats

FASTA, FASTQ (Phred+33), SAM (CIGAR, FLAG), VCF

✓ QC + Align

FastQC metrics, SLIDINGWINDOW trimming, Smith-Waterman, BWA-MEM2

Output: results/sample.bam — sorted, indexed, quality-filtered.
Coverage: 30x mean, 96% genome at ≥ 10x.

The Biological Questions

Genomic DNA analysis

"Does this patient carry a disease-causing variant?"

BAM → Variant calling (GATK) → VCF → annotation → clinical report

RNA-seq analysis

"Which genes are turned on/off in cancer vs normal?"

FASTQ → align → count → DESeq2 → DE gene list → pathway analysis

Today's capstone combines both: QC → align → variant call → DE analysis

Module 5Day 2 · 70 min

BAM Processing & Variant Calling

Aligned reads are just the raw material. Now we ask the real question: where does this genome differ from the reference?

Notebook · 05_variant_calling.ipynb GATK · Picard · SAMtools Sample chr22 BAM

The one idea

A variant is any position where your sample's bases disagree with the reference — the whole pipeline exists to call those positions confidently.

You'll walk out able to

Mark duplicates & recalibrate base quality (BQSR)
Call variants with GATK HaplotypeCaller
Filter calls and sanity-check Ti/Tv ≈ 2.0

What is a Variant?

SNV

Single nucleotide variant
A→G, C→T, etc.
Most common: 4–5M per genome

Indel

Insertion or deletion
AATG→A or A→AATG
~500K per genome

Structural

Large rearrangements, CNV, translocation
>50 bp; require special callers

Any position where your sample differs from the reference genome (GRCh38).

GATK HaplotypeCaller

# Step 1: Mark duplicates
gatk MarkDuplicates \
  -I results/sample.bam \
  -O results/sample_markdup.bam \
  -M results/sample_markdup_metrics.txt

# Step 2: Base Quality Score Recalibration (BQSR)
gatk BaseRecalibrator -I sample_markdup.bam \
  -R reference/chr22.fa --known-sites dbsnp_146.vcf.gz

# Step 3: Variant calling
gatk HaplotypeCaller \
  -R reference/chr22.fa \
  -I results/sample_markdup_recal.bam \
  -O results/sample_raw.vcf.gz \
  --emit-ref-confidence GVCF

VCF Quality Filtering

    # Raw GATK output includes both true variants and artifacts

    Total variants: 4,823,104

    # Hard filters (fast, good for single samples)

    FILTER=PASS:   4,312,456 (89.4%)

    QUAL >= 30:    4,198,901 (87.1%)

    DP >= 10:      4,289,203 (88.9%)

    All filters:   3,987,234 (82.7%)

Ti/Tv ratio — the quality sanity check

All variants: Ti/Tv = 1.83 ← suspicious
After filtering: Ti/Tv = 2.07 ← expected for WGS

Allele Frequency Spectrum

import matplotlib.pyplot as plt

afs = [float(v['af']) for v in passing_variants]

fig, ax = plt.subplots(figsize=(8, 4))
ax.hist(afs, bins=30, color='steelblue')
ax.axvline(0.5, color='orange',
           linestyle='--', label='AF=0.5 (het)')
ax.set_xlabel('Allele Frequency')
ax.set_title('AF Distribution')
plt.show()

What to expect

Germline: bimodal — heterozygous ~0.5, homozygous ~1.0

Somatic/tumor: uniform distribution (clonal evolution)

Contamination: peak near 0.5 is too flat

Variant Annotation

Functional impact

Synonymous: same amino acid (silent)
Missense: different amino acid
Nonsense: premature stop codon
Frameshift: shifts reading frame
Splice site: disrupts splicing

Annotation tools

VEP (Ensembl Variant Effect Predictor)
ANNOVAR
SnpEff

Cross-reference: ClinVar, gnomAD, COSMIC, OMIM

Module 6Day 2 · 70 min

RNA-seq & DESeq2

DNA tells you what a cell could do. RNA tells you what it's actually doing right now.

Notebook · 06_rnaseq_deseq2.ipynb HISAT2 / STAR · featureCounts · DESeq2 Counts → biology

The one idea

Differential expression is a statistics problem: which genes change more than noise predicts? DESeq2 models counts with a negative binomial to find out.

You'll walk out able to

Splice-aware align & quantify into a count matrix
Normalize with median-of-ratios (not just CPM)
Read a volcano plot & a PCA the way a reviewer does

DNA vs RNA: What Are We Measuring?

DNA sequencing (WGS)

What could happen
Permanent sequence variants
Same in every cell of the body
Doesn't change over time

RNA sequencing

What is happening right now
Gene expression levels
Varies by cell type, condition, time
Responds to treatment

RNA = the phenotype. DNA = the genotype. Both matter.

RNA-seq Pipeline

FASTQ (RNA reads)
↓
Splice-aware alignment: HISAT2 / STAR
↓
Count reads per gene: featureCounts / STAR --quantMode
↓
Count matrix (genes × samples)
↓
DESeq2 / edgeR: differential expression
↓
DE gene list → pathway analysis → biology

DESeq2: Median-of-Ratios Normalization

    Problem: Sample A has 10M reads, Sample B has 15M reads.

    Gene X: 1000 counts in A, 1200 counts in B.

    Is gene X really higher in B? No — it's just a bigger library.

import numpy as np

def size_factors(counts):
    # 1. Geometric mean per gene
    log_c  = np.log(counts[counts.min(axis=1) > 0])
    log_gm = log_c.mean(axis=1, keepdims=True)

    # 2. Log-ratios of each sample to reference
    ratios = log_c - log_gm

    # 3. Size factor = exp(median ratio)
    return np.exp(np.median(ratios, axis=0))

Why median?

The median is robust to highly expressed genes.
A few genes with 100,000 counts would dominate the mean, but barely move the median.

Negative Binomial Model

Why not a t-test?

RNA-seq counts are:

Non-negative integers (0, 1, 2, ...)
Right-skewed (most genes: low; few genes: very high)
Variance > mean (overdispersion)

Normal distribution doesn't fit this.

Negative Binomial

Var(X) = μ + α·μ²

α = per-gene dispersion

DESeq2 estimates α from the data using an empirical Bayes "shrinkage" approach

Volcano Plot: Visualizing DE Results

fig, ax = plt.subplots(figsize=(8, 7))

# Classify genes
up   = (log2fc >= 1) & (padj < 0.05)  # upregulated
down = (log2fc <= -1) & (padj < 0.05) # downregulated

# Plot
ax.scatter(log2fc[~(up|down)], -np.log10(padj[~(up|down)]),
           c='lightgray', alpha=0.5, s=20, label='NS')
ax.scatter(log2fc[up], -np.log10(padj[up]),
           c='#e74c3c', alpha=0.8, s=40, label=f'Up ({up.sum()})')
ax.scatter(log2fc[down], -np.log10(padj[down]),
           c='#3498db', alpha=0.8, s=40, label=f'Down ({down.sum()})')

# Threshold lines
ax.axhline(-np.log10(0.05), color='gray', linestyle='--')
ax.axvline(1, color='gray', linestyle=':')
ax.axvline(-1, color='gray', linestyle=':')

Interpreting DE Results

What counts as DE?

padj < 0.05 (FDR controlled)
|log2FC| ≥ 1 (≥2-fold change)
Both thresholds matter!

Never filter on raw p-value alone.

PCA check

Before DE testing:

PC1 should separate conditions
Replicates should cluster tightly
Outliers → investigate or remove
Batch effects → ComBat / limma

Module 7Day 2 · 50 min

Visualization & Interpretation

A result no one can understand is not a result. This module turns matrices into figures a reviewer will believe.

Notebook · 07_visualization.ipynb matplotlib · seaborn Publication-ready figures

The one idea

The right plot answers one question at a glance. Choose the figure by the question, not by what looks impressive.

You'll walk out able to

Build coverage plots, volcanoes & heatmaps
Use PCA to spot batch effects & outliers
Run pathway enrichment (ORA / GSEA) on a gene list

Why Visualization Matters

"A result no one can understand is not a result."

Coverage plot

See reads pile up at each genomic position. Find CNVs, deletions, low-coverage regions.

Volcano + Heatmap

Communicate DE results clearly. Which genes? How much? In which direction?

PCA + Pathway

Big-picture: sample clustering, biological processes enriched.

Coverage Plots

fig, axes = plt.subplots(3, 1, figsize=(14, 9), sharex=True)

# Coverage depth
axes[0].fill_between(positions, coverage, color='steelblue', alpha=0.7)
axes[0].set_title('Coverage Depth')

# Log2 ratio (tumor/normal — detects copy number changes)
log2_ratio = np.log2((cov_tumor + 1) / (cov_normal + 1))
axes[1].bar(positions, log2_ratio,
            color=np.where(log2_ratio > 0, '#e74c3c', '#3498db'))
axes[1].axhline(0, color='black', linewidth=0.8)
axes[1].set_title('log2(tumor / normal) — CNV track')

# Gene annotation track
for exon_start, exon_end in exons:
    axes[2].fill_between([exon_start, exon_end], [0], [1], color='steelblue')
axes[2].set_title('Gene model')

Expression Heatmap

import numpy as np

# Z-score normalize: (x - mean) / std per gene
z = (matrix - matrix.mean(axis=1, keepdims=True)) \
  / matrix.std(axis=1, keepdims=True)

fig, ax = plt.subplots(figsize=(9, 10))
im = ax.imshow(z, cmap='RdBu_r',
               aspect='auto',
               vmin=-2.5, vmax=2.5)
plt.colorbar(im, label='Z-score')

# Separate conditions with a line
ax.axvline(n_ctrl - 0.5,
           color='black', linewidth=2)

Design choices

Z-score — normalizes per gene so all are on the same scale
RdBu_r — red=high, blue=low (common convention)
Clustering — group genes with similar patterns
Color blind — consider viridis or adding patterns

PCA: Your First QC Plot

import numpy as np

# PCA on log2-normalized counts
X = log_cpm.T  # samples as rows
X_c = X - X.mean(axis=0)

# SVD for PCA
U, S, Vt = np.linalg.svd(X_c,
                          full_matrices=False)
pcs = U * S  # sample scores

var = (S**2) / (S**2).sum() * 100

ax.scatter(pcs[:, 0], pcs[:, 1])
ax.set_xlabel(f'PC1 ({var[0]:.1f}%)')
ax.set_ylabel(f'PC2 ({var[1]:.1f}%)')

What to look for

Conditions separate on PC1
Replicates cluster tightly
Outlier sample → investigate
No separation → no DE signal
Unexpected clustering → batch effect

Pathway Enrichment

Over-representation analysis (ORA)

Given DE genes, which GO terms / KEGG pathways are enriched beyond chance?

Hypergeometric test:
P(k or more DE genes in pathway of size m, from universe of N genes with K DE)

GSEA

Gene Set Enrichment Analysis: use ALL genes ranked by fold change, not just those passing a threshold.

More sensitive than ORA for subtle but coordinated gene expression changes.

Tools: clusterProfiler (R), gProfiler2, DAVID, g:Profiler

Module 8Day 2 · 50 min

Capstone Challenge

No new tools. Just you, a dataset you've never seen, and every skill from the last two days.

Notebook · 08_capstone_exercises.ipynb Synthetic chr22 data QC → align → call → DE

The one idea

Real bioinformatics is integration — stitching QC, alignment, variant calling, and expression into one coherent answer about the biology.

You'll walk out able to

Run an end-to-end pipeline unassisted
Debug a broken step from the error alone
Defend your conclusions with the figures to back them

The Challenge

Challenge A

QC + Trim

Load 1000 simulated reads.
QC them. Find adapters. Trim.
Report before/after stats.

Challenge B

Variants

Parse a 50-variant VCF.
Apply quality filters.
Compute Ti/Tv. Plot the landscape.

Challenge C

DESeq2

Normalize 100 genes × 6 samples.
Run DE testing.
Volcano plot + heatmap.

Data: data/capstone/ — three synthetic files, realistic properties
Time: 40 minutes independent work + 10 minutes group discussion

Capstone Data

File	Content	Key features
`synthetic_reads.fastq`	1000 reads, 150 bp	Quality degradation, ~10% Illumina adapter contamination
`synthetic_variants.vcf`	50 SNVs on chr22	Ti/Tv ≈ 2.0, QUAL 10–800, 4 filtered variants
`synthetic_counts.tsv`	100 genes × 6 samples	3 ctrl + 3 treated; 20 genes are truly DE (fold change 2–8x)

All generated with generate_capstone_data.py (seed=42) — fully reproducible.

What to Submit

Fill in the final summary cell in 08_capstone_exercises.ipynb:

Metric	Your result
Input reads / reads after trim	___
Adapter contamination %	___
High-confidence variants	___
Ti/Tv ratio (passing)	___
DE genes (padj < 0.05)	___
Most significantly upregulated gene	___

Workshop Recap

Day 1

Linux CLI, subprocess, pathlib
FASTA/FASTQ/SAM/VCF formats
FastQC metrics, Trimmomatic
Smith-Waterman, BWA-MEM2

Day 2

GATK variant calling, VCF filtering
DESeq2 normalization + NB testing
Volcano, heatmap, PCA, pathway plots
Full end-to-end capstone

Where to Go From Here

Deepen

Bioinformatics Algorithms (Compeau & Pevzner)
GATK Best Practices documentation
DESeq2 vignette

Practice

SRA public datasets
ENCODE project
TCGA cancer data
gnomAD browser

Build

Nextflow / Snakemake pipelines
Containerize with Docker
Run on HPC / AWS / GCP
Contribute to Bioconductor

Thank You

You built a complete genomics pipeline from scratch.

Notebooks: Available in your workshop directory

Slides: Open day1_slides.html or day2_slides.html in any browser

Press S: View speaker notes

Questions, feedback, collaboration: contact the instructors