Bioinformatics Workshop

Day 1: Raw Data to Aligned Reads

2 days · 4 hours each · Hands-on, tutorial-oriented

Linux CLI Sequence Formats Quality Control Alignment

Goals for Today

  • Navigate the Linux command line confidently
  • Understand what's inside FASTA, FASTQ, SAM, and VCF files
  • Assess read quality and apply trimming
  • Understand Smith-Waterman and how BWA uses it
Format: Each module = 10 min lecture + 30 min notebook exercises.
Run cells as we go — errors are expected and educational.

The Bioinformatics Pipeline

Patient / Sample
DNA/RNA extraction
Library preparation
Sequencing → FASTQ files ← YOU ARE HERE
Quality control + trimming  Module 3
Alignment to reference  Module 4
Variant calling / expression quantification  Day 2
Biological interpretation

Why This Matters

Scale

1000 Genomes: 2,500 genomes, 84 TB raw data.
UK Biobank: 500,000 participants.

Medicine

Every hospital will sequence patients routinely by 2030.
FDA-approved genomic tests: 75+ and rising.

Jobs

Bioinformatics: fastest-growing field in biology.
Median salary: $95K–$140K.

Module 1

Linux CLI for Bioinformatics

45 minutes · Notebook: 01_linux_cli.ipynb

Why the Terminal?

  • All major tools: bwa, samtools, gatk, trimmomatic — CLI only
  • Pipelines process hundreds of samples — no GUI can do that
  • Reproducibility: your script is your methods section
  • SSH into HPC clusters: only a terminal is available

Essential Commands

Navigation

pwd           # where am I?
ls -lh        # list with sizes
cd data/      # change directory
mkdir -p results/qc
find . -name "*.fastq"

Inspect files

head -n 8 sample.fastq
wc -l sample.fastq    # count lines
file sample.fastq.gz  # identify type

Pipes

# Count reads in FASTQ
grep -c '^@' sample.fastq

# Chromosome counts in BED
cut -f1 genes.bed | sort | uniq -c | sort -rn

# Process gzip without decompressing
zcat sample.fastq.gz | head -n 8

Shell Scripts = Reproducible Pipelines

#!/usr/bin/env bash
set -euo pipefail   # fail on errors, undefined vars, pipe failures

SAMPLES=(SRR001 SRR002 SRR003)
REF="reference/chr22.fa"
THREADS=8

mkdir -p results/alignment

for SAMPLE in "${SAMPLES[@]}"; do
    echo "[$(date)] Aligning ${SAMPLE}..."
    bwa mem -t ${THREADS} ${REF} \
        data/${SAMPLE}.fastq.gz \
        | samtools sort -o results/alignment/${SAMPLE}.bam
    samtools index results/alignment/${SAMPLE}.bam
    echo "[$(date)] Done: ${SAMPLE}"
done

This script is your methods section. Version control it in git.

Python + Shell: The Best of Both

import subprocess
from pathlib import Path

def run(cmd):
    result = subprocess.run(cmd, shell=True, capture_output=True, text=True)
    return result.stdout

# Count reads in a FASTQ file
n_reads = int(run("grep -c '^@' sample.fastq").strip())
print(f"Reads: {n_reads}")

# Build output paths safely
def output_bam(fastq: Path, outdir: Path) -> Path:
    stem = fastq.name.replace('.fastq.gz', '').replace('.fastq', '')
    return outdir / f"{stem}.bam"

print(output_bam(Path("SRR001.fastq.gz"), Path("results")))

Module 2

Sequence Data Formats

45 minutes · Notebook: 02_sequence_formats.ipynb

Format Landscape

FormatContainsTypical sizeTool
FASTASequence onlyReference: 3 GBBLASTn, BWA index
FASTQSequence + quality1–20 GB/sampleTrimmomatic, BWA
SAM/BAMAligned reads5–30 GBSAMtools, GATK
VCFVariants100 MB–10 GBGATK, bcftools

FASTQ Format

@read_001 chr22:sim length=150     ← header (@ prefix)
ACGTACGTACGTACGTACGTACGTACGT...   ← sequence (150 bases)
+                                  ← separator
IIIIHHGG::IIIIIIIIIIIIIIIIII...   ← quality (ASCII Phred+33)

Phred Quality Score

Q = −10 log₁₀(Perror)

ScoreError rateAccuracyASCII char
Q101 in 1090%'+' (43)
Q201 in 10099%'5' (53)
Q301 in 1,00099.9%'?' (63)
Q401 in 10,00099.99%'I' (73)

SAM Format

# Header
@HD VN:1.6  SO:coordinate
@SQ SN:chr22 LN:50818468

# Alignment record
QNAME   FLAG RNAME  POS     MAPQ CIGAR    RNEXT PNEXT TLEN SEQ  QUAL  TAGS
read001  0   chr22  17000001  60  75M2I73M  *     0     0   ACG  III   NM:i:2

CIGAR string: 75M2I73M

75 bp match · 2 bp insertion in query · 73 bp match
M=match, I=insertion, D=deletion, S=soft clip, N=intron skip

VCF Format

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Read depth">
#CHROM  POS       ID       REF ALT QUAL  FILTER INFO              FORMAT  SAMPLE1
chr22   17000100  rs12345  A   G   856.3 PASS   DP=45;AF=0.52     GT:DP   0/1:45

GT field

0/1 = heterozygous (one ref, one alt)
1/1 = homozygous alt
0/0 = homozygous ref

Key INFO fields

DP = total depth
AF = allele frequency
MQ = mapping quality

Module 3

Quality Control + Read Trimming

60 minutes · Notebook: 03_quality_control.ipynb

Why QC?

What goes wrong

  • Quality degrades at 3′ end of reads
  • Adapter contamination (library prep artifact)
  • GC bias (PCR amplification skew)
  • Optical duplicates
  • Low-complexity sequences (polyA, polyG)

Consequences if ignored

  • Misalignments → false variant calls
  • Adapter reads → alignment failures
  • Low-quality bases → noise in pileup
  • Biased counts → wrong DE results

FastQC Dashboard

Per-base quality

Quality vs position in read.
Green: Q>28, Yellow: Q20-28, Red: Q<20

Adapter content

% reads containing adapter at each position.
>5% = definite contamination

GC distribution

Should match reference GC%.
Human genome: ~41% GC
# Run FastQC (real tool)
fastqc sample.fastq.gz -o results/qc/

# Or multiqc to aggregate many samples
multiqc results/qc/ -o results/multiqc/

Trimmomatic: The Gold Standard Trimmer

trimmomatic PE \
  sample_R1.fastq.gz sample_R2.fastq.gz \
  sample_R1_paired.fastq.gz sample_R1_unpaired.fastq.gz \
  sample_R2_paired.fastq.gz sample_R2_unpaired.fastq.gz \
  ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 \  # adapter sequences
  LEADING:3 \                             # trim 5' if Q < 3
  TRAILING:3 \                            # trim 3' if Q < 3
  SLIDINGWINDOW:4:20 \                    # 4-base window, min mean Q=20
  MINLEN:36                               # discard reads < 36 bp
SLIDINGWINDOW:4:20 — scan from 5′ end; trim from the first position where the mean quality in a 4-base window drops below 20.

Before vs After Trimming

Before

  • 150 bp reads
  • Quality drops at position 120+
  • 15% contain adapter
  • Mean Q = 29.3

After

  • Mean length 128 bp
  • Quality uniform to read end
  • <0.1% adapter
  • Mean Q = 34.1
  • 2% reads discarded (too short)

Rule of thumb: if >10% of reads are discarded, investigate the cause.

Module 4

Read Alignment

90 minutes · Notebook: 04_alignment.ipynb

The Alignment Problem

30 million 150bp puzzle pieces → put them back into a 3 billion bp genome

Reference: ...ACGTACGTGCATGCATGCATCGATCG...
Read: GCATGCATGC
Position: chr22:17,000,100

Challenges

Mismatches (SNPs), indels, repeat regions, splicing (RNA), structural variants

BWA-MEM2 approach

1. Seed (exact match using BWT index)
2. Extend (Smith-Waterman)
3. Assign mapping quality

Smith-Waterman: Local Alignment

def smith_waterman(query, ref, match=2, mismatch=-1, gap=-2):
    m, n = len(query), len(ref)
    H    = np.zeros((m+1, n+1))   # scoring matrix

    for i in range(1, m+1):
        for j in range(1, n+1):
            s = match if query[i-1] == ref[j-1] else mismatch
            H[i,j] = max(
                0,                     # local: allow restart
                H[i-1, j-1] + s,      # diagonal (match/mismatch)
                H[i-1, j]   + gap,    # deletion in query
                H[i,   j-1] + gap,    # insertion in query
            )

    best_pos = np.unravel_index(H.argmax(), H.shape)
    return H, H[best_pos], best_pos

Smith-Waterman: Example

Query:  ACGTACGT
Ref:    TTTACGTACGTGGG

Alignment:
Query:  ACGTACGT
        ||||||||
Ref:    ACGTACGT (at position 3 of ref)
Score: 16 (8 × match=2)

The DP matrix is visualized as a heatmap in the notebook — hot spots = good alignments.

From Alignment to BAM

# Index the reference genome (one-time)
bwa index reference/chr22.fa

# Align reads (produces SAM)
bwa mem -t 8 reference/chr22.fa sample.fastq.gz \
  | samtools sort -o results/sample.bam

# Index the BAM
samtools index results/sample.bam

# Quick stats
samtools flagstat results/sample.bam
Expected flagstat output:
29,123,456 + 0 mapped (97.24% : N/A)
28,101,234 + 0 properly paired (93.54% : N/A)

Coverage: The Key Output Metric

DepthApplication
5–10xPopulation studies (cheap)
30xGold standard WGS
100xCancer (somatic variants)
500x+cfDNA (liquid biopsy)

Coverage uniformity

% genome covered at ≥ 10x should be > 95%.
Repeat regions, centromeres, and GC-extreme regions are always under-covered.

Day 1 Summary

Module 1: CLI

pipe, grep, awk, cut
subprocess, pathlib

Module 2: Formats

FASTA, FASTQ (Phred)
SAM (CIGAR, FLAG)
VCF (FILTER, AF)

Modules 3-4: QC+Align

FastQC metrics
Trimmomatic SLIDINGWINDOW
Smith-Waterman, BWA
Tomorrow: BAM → variants → expression → visualization → capstone

See You Tomorrow

Day 2 starts at 9:00 AM

Tonight: review your notebook outputs.
Make sure your trimming pipeline produces reasonable before/after stats.
Check that your Smith-Waterman correctly traces back the alignment.

Questions? Office hours after the session or email the instructors.