Bioinformatics Workshop

Day 2: From Aligned Reads to Biology

Day 2 of 2 · 4 hours

Variant Calling RNA-seq / DESeq2 Visualization Capstone

Day 1 Recap

✓ Linux CLI

pipes, grep, awk, loops, subprocess, pathlib

✓ Formats

FASTA, FASTQ (Phred+33), SAM (CIGAR, FLAG), VCF

✓ QC + Align

FastQC metrics, SLIDINGWINDOW trimming, Smith-Waterman, BWA-MEM2
Output: results/sample.bam — sorted, indexed, quality-filtered.
Coverage: 30x mean, 96% genome at ≥ 10x.

The Biological Questions

Genomic DNA analysis

"Does this patient carry a disease-causing variant?"

BAM → Variant calling (GATK) → VCF → annotation → clinical report

RNA-seq analysis

"Which genes are turned on/off in cancer vs normal?"

FASTQ → align → count → DESeq2 → DE gene list → pathway analysis

Today's capstone combines both: QC → align → variant call → DE analysis

Module 5

BAM Processing + Variant Calling

70 minutes · Notebook: 05_variant_calling.ipynb

What is a Variant?

SNV

Single nucleotide variant
A→G, C→T, etc.
Most common: 4–5M per genome

Indel

Insertion or deletion
AATG→A or A→AATG
~500K per genome

Structural

Large rearrangements, CNV, translocation
>50 bp; require special callers

Any position where your sample differs from the reference genome (GRCh38).

GATK HaplotypeCaller

# Step 1: Mark duplicates
gatk MarkDuplicates \
  -I results/sample.bam \
  -O results/sample_markdup.bam \
  -M results/sample_markdup_metrics.txt

# Step 2: Base Quality Score Recalibration (BQSR)
gatk BaseRecalibrator -I sample_markdup.bam \
  -R reference/chr22.fa --known-sites dbsnp_146.vcf.gz

# Step 3: Variant calling
gatk HaplotypeCaller \
  -R reference/chr22.fa \
  -I results/sample_markdup_recal.bam \
  -O results/sample_raw.vcf.gz \
  --emit-ref-confidence GVCF

VCF Quality Filtering

# Raw GATK output includes both true variants and artifacts
Total variants: 4,823,104

# Hard filters (fast, good for single samples)
FILTER=PASS: 4,312,456 (89.4%)
QUAL >= 30: 4,198,901 (87.1%)
DP >= 10: 4,289,203 (88.9%)
All filters: 3,987,234 (82.7%)

Ti/Tv ratio — the quality sanity check

All variants: Ti/Tv = 1.83 ← suspicious
After filtering: Ti/Tv = 2.07 ← expected for WGS

Allele Frequency Spectrum

import matplotlib.pyplot as plt

afs = [float(v['af']) for v in passing_variants]

fig, ax = plt.subplots(figsize=(8, 4))
ax.hist(afs, bins=30, color='steelblue')
ax.axvline(0.5, color='orange',
           linestyle='--', label='AF=0.5 (het)')
ax.set_xlabel('Allele Frequency')
ax.set_title('AF Distribution')
plt.show()

What to expect

Germline: bimodal — heterozygous ~0.5, homozygous ~1.0

Somatic/tumor: uniform distribution (clonal evolution)

Contamination: peak near 0.5 is too flat

Variant Annotation

Functional impact

  • Synonymous: same amino acid (silent)
  • Missense: different amino acid
  • Nonsense: premature stop codon
  • Frameshift: shifts reading frame
  • Splice site: disrupts splicing

Annotation tools

VEP (Ensembl Variant Effect Predictor)
ANNOVAR
SnpEff

Cross-reference: ClinVar, gnomAD, COSMIC, OMIM

Module 6

RNA-seq and DESeq2

70 minutes · Notebook: 06_rnaseq_deseq2.ipynb

DNA vs RNA: What Are We Measuring?

DNA sequencing (WGS)

  • What could happen
  • Permanent sequence variants
  • Same in every cell of the body
  • Doesn't change over time

RNA sequencing

  • What is happening right now
  • Gene expression levels
  • Varies by cell type, condition, time
  • Responds to treatment

RNA = the phenotype. DNA = the genotype. Both matter.

RNA-seq Pipeline

FASTQ (RNA reads)
Splice-aware alignment: HISAT2 / STAR
Count reads per gene: featureCounts / STAR --quantMode
Count matrix (genes × samples)
DESeq2 / edgeR: differential expression
DE gene list → pathway analysis → biology

DESeq2: Median-of-Ratios Normalization

Problem: Sample A has 10M reads, Sample B has 15M reads.
Gene X: 1000 counts in A, 1200 counts in B.
Is gene X really higher in B? No — it's just a bigger library.
import numpy as np

def size_factors(counts):
    # 1. Geometric mean per gene
    log_c  = np.log(counts[counts.min(axis=1) > 0])
    log_gm = log_c.mean(axis=1, keepdims=True)

    # 2. Log-ratios of each sample to reference
    ratios = log_c - log_gm

    # 3. Size factor = exp(median ratio)
    return np.exp(np.median(ratios, axis=0))

Why median?

The median is robust to highly expressed genes.
A few genes with 100,000 counts would dominate the mean, but barely move the median.

Negative Binomial Model

Why not a t-test?

RNA-seq counts are:
  • Non-negative integers (0, 1, 2, ...)
  • Right-skewed (most genes: low; few genes: very high)
  • Variance > mean (overdispersion)

Normal distribution doesn't fit this.

Negative Binomial

Var(X) = μ + α·μ²

α = per-gene dispersion

DESeq2 estimates α from the data using an empirical Bayes "shrinkage" approach

Volcano Plot: Visualizing DE Results

fig, ax = plt.subplots(figsize=(8, 7))

# Classify genes
up   = (log2fc >= 1) & (padj < 0.05)  # upregulated
down = (log2fc <= -1) & (padj < 0.05) # downregulated

# Plot
ax.scatter(log2fc[~(up|down)], -np.log10(padj[~(up|down)]),
           c='lightgray', alpha=0.5, s=20, label='NS')
ax.scatter(log2fc[up], -np.log10(padj[up]),
           c='#e74c3c', alpha=0.8, s=40, label=f'Up ({up.sum()})')
ax.scatter(log2fc[down], -np.log10(padj[down]),
           c='#3498db', alpha=0.8, s=40, label=f'Down ({down.sum()})')

# Threshold lines
ax.axhline(-np.log10(0.05), color='gray', linestyle='--')
ax.axvline(1, color='gray', linestyle=':')
ax.axvline(-1, color='gray', linestyle=':')

Interpreting DE Results

What counts as DE?

  • padj < 0.05 (FDR controlled)
  • |log2FC| ≥ 1 (≥2-fold change)
  • Both thresholds matter!

Never filter on raw p-value alone.

PCA check

Before DE testing:
  • PC1 should separate conditions
  • Replicates should cluster tightly
  • Outliers → investigate or remove
  • Batch effects → ComBat / limma

Module 7

Visualization

50 minutes · Notebook: 07_visualization.ipynb

Why Visualization Matters

"A result no one can understand is not a result."

Coverage plot

See reads pile up at each genomic position. Find CNVs, deletions, low-coverage regions.

Volcano + Heatmap

Communicate DE results clearly. Which genes? How much? In which direction?

PCA + Pathway

Big-picture: sample clustering, biological processes enriched.

Coverage Plots

fig, axes = plt.subplots(3, 1, figsize=(14, 9), sharex=True)

# Coverage depth
axes[0].fill_between(positions, coverage, color='steelblue', alpha=0.7)
axes[0].set_title('Coverage Depth')

# Log2 ratio (tumor/normal — detects copy number changes)
log2_ratio = np.log2((cov_tumor + 1) / (cov_normal + 1))
axes[1].bar(positions, log2_ratio,
            color=np.where(log2_ratio > 0, '#e74c3c', '#3498db'))
axes[1].axhline(0, color='black', linewidth=0.8)
axes[1].set_title('log2(tumor / normal) — CNV track')

# Gene annotation track
for exon_start, exon_end in exons:
    axes[2].fill_between([exon_start, exon_end], [0], [1], color='steelblue')
axes[2].set_title('Gene model')

Expression Heatmap

import numpy as np

# Z-score normalize: (x - mean) / std per gene
z = (matrix - matrix.mean(axis=1, keepdims=True)) \
  / matrix.std(axis=1, keepdims=True)

fig, ax = plt.subplots(figsize=(9, 10))
im = ax.imshow(z, cmap='RdBu_r',
               aspect='auto',
               vmin=-2.5, vmax=2.5)
plt.colorbar(im, label='Z-score')

# Separate conditions with a line
ax.axvline(n_ctrl - 0.5,
           color='black', linewidth=2)

Design choices

  • Z-score — normalizes per gene so all are on the same scale
  • RdBu_r — red=high, blue=low (common convention)
  • Clustering — group genes with similar patterns
  • Color blind — consider viridis or adding patterns

PCA: Your First QC Plot

import numpy as np

# PCA on log2-normalized counts
X = log_cpm.T  # samples as rows
X_c = X - X.mean(axis=0)

# SVD for PCA
U, S, Vt = np.linalg.svd(X_c,
                          full_matrices=False)
pcs = U * S  # sample scores

var = (S**2) / (S**2).sum() * 100

ax.scatter(pcs[:, 0], pcs[:, 1])
ax.set_xlabel(f'PC1 ({var[0]:.1f}%)')
ax.set_ylabel(f'PC2 ({var[1]:.1f}%)')

What to look for

  • Conditions separate on PC1
  • Replicates cluster tightly
  • Outlier sample → investigate
  • No separation → no DE signal
  • Unexpected clustering → batch effect

Pathway Enrichment

Over-representation analysis (ORA)

Given DE genes, which GO terms / KEGG pathways are enriched beyond chance?

Hypergeometric test:
P(k or more DE genes in pathway of size m, from universe of N genes with K DE)

GSEA

Gene Set Enrichment Analysis: use ALL genes ranked by fold change, not just those passing a threshold.

More sensitive than ORA for subtle but coordinated gene expression changes.

Tools: clusterProfiler (R), gProfiler2, DAVID, g:Profiler

Module 8

Capstone Challenge

50 minutes · Notebook: 08_capstone_exercises.ipynb

The Challenge

Challenge A

QC + Trim

Load 1000 simulated reads.
QC them. Find adapters. Trim.
Report before/after stats.

Challenge B

Variants

Parse a 50-variant VCF.
Apply quality filters.
Compute Ti/Tv. Plot the landscape.

Challenge C

DESeq2

Normalize 100 genes × 6 samples.
Run DE testing.
Volcano plot + heatmap.
Data: data/capstone/ — three synthetic files, realistic properties
Time: 40 minutes independent work + 10 minutes group discussion

Capstone Data

FileContentKey features
synthetic_reads.fastq 1000 reads, 150 bp Quality degradation, ~10% Illumina adapter contamination
synthetic_variants.vcf 50 SNVs on chr22 Ti/Tv ≈ 2.0, QUAL 10–800, 4 filtered variants
synthetic_counts.tsv 100 genes × 6 samples 3 ctrl + 3 treated; 20 genes are truly DE (fold change 2–8x)

All generated with generate_capstone_data.py (seed=42) — fully reproducible.

What to Submit

Fill in the final summary cell in 08_capstone_exercises.ipynb:

MetricYour result
Input reads / reads after trim___
Adapter contamination %___
High-confidence variants___
Ti/Tv ratio (passing)___
DE genes (padj < 0.05)___
Most significantly upregulated gene___

Workshop Recap

Day 1

  • Linux CLI, subprocess, pathlib
  • FASTA/FASTQ/SAM/VCF formats
  • FastQC metrics, Trimmomatic
  • Smith-Waterman, BWA-MEM2

Day 2

  • GATK variant calling, VCF filtering
  • DESeq2 normalization + NB testing
  • Volcano, heatmap, PCA, pathway plots
  • Full end-to-end capstone

Where to Go From Here

Deepen

Bioinformatics Algorithms (Compeau & Pevzner)
GATK Best Practices documentation
DESeq2 vignette

Practice

SRA public datasets
ENCODE project
TCGA cancer data
gnomAD browser

Build

Nextflow / Snakemake pipelines
Containerize with Docker
Run on HPC / AWS / GCP
Contribute to Bioconductor

Thank You

You built a complete genomics pipeline from scratch.

Notebooks: Available in your workshop directory

Slides: Open day1_slides.html or day2_slides.html in any browser

Press S: View speaker notes

Questions, feedback, collaboration: contact the instructors