Bioinformatics Workshop · 2 days, 4 hours each

Raw Data to Aligned Reads

Day 1 of 2 · Hands-on & tutorial-oriented

Linux CLI Sequence Formats Quality Control Alignment

Press S for speaker notes · F for fullscreen · → to begin

Goals for Today

Navigate the Linux command line confidently
Understand what's inside FASTA, FASTQ, SAM, and VCF files
Assess read quality and apply trimming
Understand Smith-Waterman and how BWA uses it

Format: Each module = 10 min lecture + 30 min notebook exercises.
Run cells as we go — errors are expected and educational.

The Bioinformatics Pipeline

Patient / Sample
↓
DNA/RNA extraction
↓
Library preparation
↓
Sequencing → FASTQ files ← YOU ARE HERE
↓
Quality control + trimming  Module 3
↓
Alignment to reference  Module 4
↓
Variant calling / expression quantification  Day 2
↓
Biological interpretation

Why This Matters

Scale

1000 Genomes: 2,500 genomes, 84 TB raw data.
UK Biobank: 500,000 participants.

Medicine

Every hospital will sequence patients routinely by 2030.
FDA-approved genomic tests: 75+ and rising.

Jobs

Bioinformatics: fastest-growing field in biology.
Median salary: $95K–$140K.

Module 1Day 1 · 45 min

Linux CLI for Bioinformatics

Every genomics tool worth using lives in the terminal. In 45 minutes you go from nervous to fluent.

Notebook · 01_linux_cli.ipynb Live browser sandbox on the module page No install required

The one idea

A pipe | streams gigabytes through small, sharp tools — never loading the whole file into memory.

You'll walk out able to

Count reads with grep, awk, cut
Write a reproducible for-loop pipeline
Glue shell to Python with subprocess

Why the Terminal?

All major tools: bwa, samtools, gatk, trimmomatic — CLI only
Pipelines process hundreds of samples — no GUI can do that
Reproducibility: your script is your methods section
SSH into HPC clusters: only a terminal is available

Essential Commands

Navigation

pwd           # where am I?
ls -lh        # list with sizes
cd data/      # change directory
mkdir -p results/qc
find . -name "*.fastq"

Inspect files

head -n 8 sample.fastq
wc -l sample.fastq    # count lines
file sample.fastq.gz  # identify type

Pipes

# Count reads in FASTQ
grep -c '^@' sample.fastq

# Chromosome counts in BED
cut -f1 genes.bed | sort | uniq -c | sort -rn

# Process gzip without decompressing
zcat sample.fastq.gz | head -n 8

Shell Scripts = Reproducible Pipelines

#!/usr/bin/env bash
set -euo pipefail   # fail on errors, undefined vars, pipe failures

SAMPLES=(SRR001 SRR002 SRR003)
REF="reference/chr22.fa"
THREADS=8

mkdir -p results/alignment

for SAMPLE in "${SAMPLES[@]}"; do
    echo "[$(date)] Aligning ${SAMPLE}..."
    bwa mem -t ${THREADS} ${REF} \
        data/${SAMPLE}.fastq.gz \
        | samtools sort -o results/alignment/${SAMPLE}.bam
    samtools index results/alignment/${SAMPLE}.bam
    echo "[$(date)] Done: ${SAMPLE}"
done

This script is your methods section. Version control it in git.

Python + Shell: The Best of Both

import subprocess
from pathlib import Path

def run(cmd):
    result = subprocess.run(cmd, shell=True, capture_output=True, text=True)
    return result.stdout

# Count reads in a FASTQ file
n_reads = int(run("grep -c '^@' sample.fastq").strip())
print(f"Reads: {n_reads}")

# Build output paths safely
def output_bam(fastq: Path, outdir: Path) -> Path:
    stem = fastq.name.replace('.fastq.gz', '').replace('.fastq', '')
    return outdir / f"{stem}.bam"

print(output_bam(Path("SRR001.fastq.gz"), Path("results")))

Module 2Day 1 · 45 min

Sequence Data Formats

Four text formats carry all of genomics. Learn to read them by eye and nothing downstream is a mystery.

Notebook · 02_sequence_formats.ipynb Runs fully in your browser

The one idea

FASTA, FASTQ, SAM/BAM, VCF are just plain text with strict rules — every tool is a translator between them.

You'll walk out able to

Decode a Phred quality string
Read a CIGAR & SAM FLAG
Pull genotypes from a VCF

Format Landscape

Format	Contains	Typical size	Tool
FASTA	Sequence only	Reference: 3 GB	BLASTn, BWA index
FASTQ	Sequence + quality	1–20 GB/sample	Trimmomatic, BWA
SAM/BAM	Aligned reads	5–30 GB	SAMtools, GATK
VCF	Variants	100 MB–10 GB	GATK, bcftools

FASTQ Format

@read_001 chr22:sim length=150     ← header (@ prefix)
ACGTACGTACGTACGTACGTACGTACGT...   ← sequence (150 bases)
+                                  ← separator
IIIIHHGG::IIIIIIIIIIIIIIIIII...   ← quality (ASCII Phred+33)

Phred Quality Score

Q = −10 log₁₀(P_error)

Score	Error rate	Accuracy	ASCII char
Q10	1 in 10	90%	'+' (43)
Q20	1 in 100	99%	'5' (53)
Q30	1 in 1,000	99.9%	'?' (63)
Q40	1 in 10,000	99.99%	'I' (73)

SAM Format

# Header
@HD VN:1.6  SO:coordinate
@SQ SN:chr22 LN:50818468

# Alignment record
QNAME   FLAG RNAME  POS     MAPQ CIGAR    RNEXT PNEXT TLEN SEQ  QUAL  TAGS
read001  0   chr22  17000001  60  75M2I73M  *     0     0   ACG  III   NM:i:2

CIGAR string: 75M2I73M

75 bp match · 2 bp insertion in query · 73 bp match
M=match, I=insertion, D=deletion, S=soft clip, N=intron skip

VCF Format

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Read depth">
#CHROM  POS       ID       REF ALT QUAL  FILTER INFO              FORMAT  SAMPLE1
chr22   17000100  rs12345  A   G   856.3 PASS   DP=45;AF=0.52     GT:DP   0/1:45

GT field

0/1 = heterozygous (one ref, one alt)
1/1 = homozygous alt
0/0 = homozygous ref

Key INFO fields

DP = total depth
AF = allele frequency
MQ = mapping quality

Module 3Day 1 · 60 min

Quality Control + Read Trimming

Garbage in, garbage out. The cheapest minutes you'll ever spend are the ones that catch bad data early.

Notebook · 03_quality_control.ipynb FastQC · MultiQC · Trimmomatic · fastp

The one idea

Sequencers make predictable mistakes — quality decay, adapters, bias. QC finds them; trimming fixes them.

You'll walk out able to

Read a FastQC report fluently
Implement a sliding-window trimmer
Judge before/after trim stats

Why QC?

What goes wrong

Quality degrades at 3′ end of reads
Adapter contamination (library prep artifact)
GC bias (PCR amplification skew)
Optical duplicates
Low-complexity sequences (polyA, polyG)

Consequences if ignored

Misalignments → false variant calls
Adapter reads → alignment failures
Low-quality bases → noise in pileup
Biased counts → wrong DE results

FastQC Dashboard

Per-base quality

Quality vs position in read.
Green: Q>28, Yellow: Q20-28, Red: Q<20

Adapter content

% reads containing adapter at each position.
>5% = definite contamination

GC distribution

Should match reference GC%.
Human genome: ~41% GC

# Run FastQC (real tool)
fastqc sample.fastq.gz -o results/qc/

# Or multiqc to aggregate many samples
multiqc results/qc/ -o results/multiqc/

Trimmomatic: The Gold Standard Trimmer

trimmomatic PE \
  sample_R1.fastq.gz sample_R2.fastq.gz \
  sample_R1_paired.fastq.gz sample_R1_unpaired.fastq.gz \
  sample_R2_paired.fastq.gz sample_R2_unpaired.fastq.gz \
  ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 \  # adapter sequences
  LEADING:3 \                             # trim 5' if Q < 3
  TRAILING:3 \                            # trim 3' if Q < 3
  SLIDINGWINDOW:4:20 \                    # 4-base window, min mean Q=20
  MINLEN:36                               # discard reads < 36 bp

SLIDINGWINDOW:4:20 — scan from 5′ end; trim from the first position where the mean quality in a 4-base window drops below 20.

Before vs After Trimming

Before

150 bp reads
Quality drops at position 120+
15% contain adapter
Mean Q = 29.3

After

Mean length 128 bp
Quality uniform to read end
<0.1% adapter
Mean Q = 34.1
2% reads discarded (too short)

Rule of thumb: if >10% of reads are discarded, investigate the cause.

Module 4Day 1 · 90 min

Read Alignment to a Reference

Thirty million short reads, one three-billion-letter genome. How do the puzzle pieces find their place?

Notebook · 04_alignment.ipynb BWA-MEM2 · SAMtools · IGV Longest, most hands-on module

The one idea

Seed with an exact match (BWT index), then extend with Smith-Waterman — fast where it can be, careful where it must be.

You'll walk out able to

Code Smith-Waterman & trace it
Align reads to a BAM, then index
Read flagstat & coverage depth

The Alignment Problem

30 million 150bp puzzle pieces → put them back into a 3 billion bp genome

    Reference: ...ACGTACGTGCATGCATGCATCGATCG...

    Read:                 GCATGCATGC

    Position: chr22:17,000,100

Challenges

Mismatches (SNPs), indels, repeat regions, splicing (RNA), structural variants

BWA-MEM2 approach

1. Seed (exact match using BWT index)
2. Extend (Smith-Waterman)
3. Assign mapping quality

Smith-Waterman: Local Alignment

def smith_waterman(query, ref, match=2, mismatch=-1, gap=-2):
    m, n = len(query), len(ref)
    H    = np.zeros((m+1, n+1))   # scoring matrix

    for i in range(1, m+1):
        for j in range(1, n+1):
            s = match if query[i-1] == ref[j-1] else mismatch
            H[i,j] = max(
                0,                     # local: allow restart
                H[i-1, j-1] + s,      # diagonal (match/mismatch)
                H[i-1, j]   + gap,    # deletion in query
                H[i,   j-1] + gap,    # insertion in query
            )

    best_pos = np.unravel_index(H.argmax(), H.shape)
    return H, H[best_pos], best_pos

Smith-Waterman: Example

    Query:  ACGTACGT

    Ref:    TTTACGTACGTGGG

    Alignment:

    Query:  ACGTACGT

            ||||||||

    Ref:    ACGTACGT (at position 3 of ref)

    Score: 16 (8 × match=2)

The DP matrix is visualized as a heatmap in the notebook — hot spots = good alignments.

From Alignment to BAM

# Index the reference genome (one-time)
bwa index reference/chr22.fa

# Align reads (produces SAM)
bwa mem -t 8 reference/chr22.fa sample.fastq.gz \
  | samtools sort -o results/sample.bam

# Index the BAM
samtools index results/sample.bam

# Quick stats
samtools flagstat results/sample.bam

Expected flagstat output:
29,123,456 + 0 mapped (97.24% : N/A)
28,101,234 + 0 properly paired (93.54% : N/A)

Coverage: The Key Output Metric

Depth	Application
5–10x	Population studies (cheap)
30x	Gold standard WGS
100x	Cancer (somatic variants)
500x+	cfDNA (liquid biopsy)

Coverage uniformity

% genome covered at ≥ 10x should be > 95%.
Repeat regions, centromeres, and GC-extreme regions are always under-covered.

Day 1 Summary

Module 1: CLI

pipe, grep, awk, cut
subprocess, pathlib

Module 2: Formats

FASTA, FASTQ (Phred)
SAM (CIGAR, FLAG)
VCF (FILTER, AF)

Modules 3-4: QC+Align

FastQC metrics
Trimmomatic SLIDINGWINDOW
Smith-Waterman, BWA

Tomorrow: BAM → variants → expression → visualization → capstone

See You Tomorrow

Day 2 starts at 9:00 AM

Tonight: review your notebook outputs.
Make sure your trimming pipeline produces reasonable before/after stats.
Check that your Smith-Waterman correctly traces back the alignment.

Questions? Office hours after the session or email the instructors.