{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Module 4: Read Alignment\n",
    "\n",
    "**Duration:** 90 minutes\n",
    "**Day:** 1 of 2\n",
    "\n",
    "### Where we left off\n",
    "\n",
    "Module 3 handed us **clean, trimmed FASTQ reads**. But a read is just a short string - it\n",
    "has no idea *where on the genome* it came from. **Alignment** places every read back onto\n",
    "a reference so we can later ask where the sample differs.\n",
    "\n",
    "### How this module works\n",
    "\n",
    "In a real pipeline you align with **two commands** and never write an aligner:\n",
    "\n",
    "```bash\n",
    "bwa-mem2 index reference.fa                                   # build the index once\n",
    "bwa-mem2 mem reference.fa reads.fq | samtools sort -o aln.bam # align + sort\n",
    "samtools index aln.bam                                        # make regions seekable\n",
    "```\n",
    "\n",
    "`bwa-mem2` uses **seed-and-extend**: its index finds short exact matches fast, then\n",
    "**Smith-Waterman** extends each seed over mismatches and indels. So we build *one* thing by\n",
    "hand - Smith-Waterman - because it's the heart of every aligner; everything else is **run\n",
    "the tool, read the output**. The DP code lives in **`align_toolkit.py`** (download it).\n",
    "\n",
    "## Learning Objectives\n",
    "\n",
    "- Trace the Smith-Waterman matrix and read off the optimal local alignment\n",
    "- Turn an alignment into a **CIGAR string** - how SAM/BAM records a match\n",
    "- Run `bwa-mem2` + `samtools` and interpret `flagstat` and coverage\n",
    "- Call variants from a pileup with `samtools mpileup | bcftools call`\n",
    "\n",
    "> **On the page:** the two explorers below - the live Smith-Waterman matrix and the pileup\n",
    "> variant-caller - let you turn the same knobs (scoring scheme, AF / depth thresholds) and\n",
    "> watch the result change. Turn them as you read."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": "%matplotlib inline\nimport matplotlib.pyplot as plt\nimport numpy as np\nimport re, os, random\nfrom pathlib import Path\nnp.random.seed(42)\nrandom.seed(42)   # simulate_pileup() uses the stdlib random module -- seed it too\nprint(\"Imports loaded.\")"
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "# align_toolkit.py holds the Smith-Waterman engine, pileup simulator + plot helpers.\n",
    "from align_toolkit import (\n",
    "    smith_waterman, traceback, traceback_path, align_and_show, cigar_for,\n",
    "    simulate_pileup, plot_sw_matrix, plot_pileup, plot_coverage, example_depths,\n",
    ")\n",
    "print(\"Alignment toolkit loaded.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## Section 1: Smith-Waterman - the one algorithm we build by hand\n",
    "\n",
    "Smith-Waterman finds the **best local alignment**: it can place a short read inside a long\n",
    "reference by ignoring leading/trailing mismatches (the `max(0, ...)` lets the alignment\n",
    "restart anywhere).\n",
    "\n",
    "**Scoring:** match +2, mismatch -1, gap -2.\n",
    "\n",
    "```text\n",
    "H[i,j] = max(\n",
    "    0,                         # local: allow restart\n",
    "    H[i-1,j-1] + score(a,b),   # diagonal (match / mismatch)\n",
    "    H[i-1,j]   - gap,          # delete in query\n",
    "    H[i,j-1]   - gap,          # insert in query\n",
    ")\n",
    "```\n",
    "\n",
    "Turn the scoring knobs live and watch the matrix and traceback recompute:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<!--widget:sw-->"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "# Align a short query inside a longer reference, then read off the alignment.\n",
    "query = \"ACGTACGT\"\n",
    "ref   = \"TTTACGTACGTGGG\"\n",
    "H, T, best_score, best_pos = smith_waterman(query, ref)\n",
    "aq, aln, ar = traceback(query, ref, H, T, best_pos)\n",
    "print(f\"Best score: {best_score:.0f} at {best_pos}\\n\")\n",
    "print(f\"Query:     {aq}\")\n",
    "print(f\"           {aln}\")\n",
    "print(f\"Ref:       {ar}\")"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "# The DP matrix as a heatmap, best cell starred (one toolkit call).\n",
    "plot_sw_matrix(H, query, ref, best_pos=best_pos)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## Section 2: Mismatches, indels, and the CIGAR string\n",
    "\n",
    "Real reads carry SNPs and small indels. Watch SW handle each, then see how SAM/BAM records\n",
    "the result - not as a two-line alignment but as a compact **CIGAR string**:\n",
    "\n",
    "- **M** aligned column (match *or* mismatch) - **I** insertion - **D** deletion -\n",
    "  **S** soft-clip (read ends outside the local alignment)"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "align_and_show(\"ACGTACGT\",   \"NNNNACGTACGTNN\", \"Perfect match\")\n",
    "align_and_show(\"ACGTAGGT\",   \"NNNNACGTACGTNN\", \"1 mismatch (SNP)\")\n",
    "align_and_show(\"ACGTTTACGT\", \"NNNNACGTACGTNN\", \"2-base insertion\")\n",
    "align_and_show(\"ACGCGT\",     \"NNNNACGTACGTNN\", \"2-base deletion\")"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "# cigar_for() aligns a read and returns its CIGAR -- column 6 of every SAM record.\n",
    "for label, read in [(\"perfect\",\"ACGTACGT\"), (\"1 SNP\",\"ACGTAGGT\"),\n",
    "                    (\"insertion\",\"ACGTTTACGT\"), (\"deletion\",\"ACGCGT\")]:\n",
    "    cig, sc = cigar_for(read, \"NNNNACGTACGTNN\")\n",
    "    print(f\"{label:10s} read={read:12s} CIGAR={cig:10s} score={sc:.0f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The **gap penalty** is the sensitivity-vs-specificity knob: cheap gaps favour indels,\n",
    "expensive gaps favour mismatches. Feel it in the explorer above, or here:"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "q, r = \"ACGTTTACGT\", \"NNNNACGTACGTNN\"   # query has a 2-base insert vs ref\n",
    "for gap_pen in [-1, -2, -4, -8]:\n",
    "    H, T, score, pos = smith_waterman(q, r, gap=gap_pen)\n",
    "    aq, aln, ar = traceback(q, r, H, T, pos)\n",
    "    print(f\"gap={gap_pen:3d}  score={score:4.0f}   {aq}  /  {ar}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## Section 3: Align with bwa-mem2, summarise with samtools\n",
    "\n",
    "That's the algorithm. At genome scale you run the real tools. After aligning, the first\n",
    "thing you check is **`samtools flagstat`** - a one-glance health report of the BAM:\n",
    "\n",
    "```bash\n",
    "bwa-mem2 mem reference.fa reads.fq | samtools sort -o aln.bam\n",
    "samtools index aln.bam\n",
    "samtools flagstat aln.bam\n",
    "```\n",
    "\n",
    "A representative `flagstat` for a 30M-read whole-genome run:\n",
    "\n",
    "```text\n",
    "30000000 + 0 in total (QC-passed reads + QC-failed reads)\n",
    "0 + 0 secondary\n",
    "12500 + 0 supplementary\n",
    "3600000 + 0 duplicates\n",
    "29160000 + 0 mapped (97.20% : N/A)\n",
    "30000000 + 0 paired in sequencing\n",
    "28590000 + 0 properly paired (95.30% : N/A)\n",
    "```\n",
    "\n",
    "You don't eyeball these on a real run - you parse the three rates that matter. That parsing\n",
    "is the Python worth keeping:"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "# Parse the three health metrics out of samtools flagstat text.\n",
    "flagstat = \"\"\"30000000 + 0 in total (QC-passed reads + QC-failed reads)\n",
    "0 + 0 secondary\n",
    "12500 + 0 supplementary\n",
    "3600000 + 0 duplicates\n",
    "29160000 + 0 mapped (97.20% : N/A)\n",
    "30000000 + 0 paired in sequencing\n",
    "28590000 + 0 properly paired (95.30% : N/A)\"\"\"\n",
    "\n",
    "def field(pattern):\n",
    "    return int(re.search(pattern, flagstat).group(1))\n",
    "\n",
    "total  = field(r\"(\\d+) \\+ \\d+ in total\")\n",
    "mapped = field(r\"(\\d+) \\+ \\d+ mapped\")\n",
    "dups   = field(r\"(\\d+) \\+ \\d+ duplicates\")\n",
    "proper = field(r\"(\\d+) \\+ \\d+ properly paired\")\n",
    "\n",
    "map_rate, dup_rate, pp_rate = mapped/total*100, dups/mapped*100, proper/total*100\n",
    "print(f\"Mapping rate:     {map_rate:5.1f}%   {'OK' if map_rate > 90 else 'LOW - wrong reference?'}\")\n",
    "print(f\"Duplication rate: {dup_rate:5.1f}%   {'OK' if dup_rate < 20 else 'HIGH - PCR over-amp?'}\")\n",
    "print(f\"Proper-pair rate: {pp_rate:5.1f}%   {'OK' if pp_rate > 90 else 'LOW - check library'}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## Section 4: Coverage depth\n",
    "\n",
    "`samtools depth` (or `samtools coverage`) reports how many reads cover each position - the\n",
    "other half of alignment QC. You want most of the genome at >=10x for confident calls.\n",
    "\n",
    "```bash\n",
    "samtools depth -a aln.bam > depth.txt   # one position per line: chrom  pos  depth\n",
    "```\n",
    "\n",
    "A real `depth.txt` has millions of rows; `example_depths()` stands in for a parsed\n",
    "whole-genome profile so we can plot the two views you always look at:"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "depths = example_depths()       # stand-in for a parsed `samtools depth` profile\n",
    "plot_coverage(depths)\n",
    "\n",
    "arr = np.array(depths)\n",
    "print(f\"Mean coverage:    {arr.mean():.1f}x\")\n",
    "print(f\"Covered >=10x:    {(arr >= 10).mean()*100:.1f}%   (germline calling)\")\n",
    "print(f\"Covered >=30x:    {(arr >= 30).mean()*100:.1f}%   (gold standard)\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## Section 5: From pileup to variant calls\n",
    "\n",
    "A **pileup** stacks the aligned reads at each reference position and counts the bases.\n",
    "Variant calling is, at its core, *counting*: where do reads disagree with the reference,\n",
    "often enough and deep enough to be real?\n",
    "\n",
    "```bash\n",
    "samtools mpileup -f reference.fa aln.bam | \\\n",
    "  bcftools call -mv -Oz -o calls.vcf.gz      # -m multiallelic caller, -v variants only\n",
    "```\n",
    "\n",
    "Explore the two thresholds every caller exposes - minimum allele frequency and minimum\n",
    "depth - and watch true SNPs vs sequencing noise get called:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<!--widget:pileup-->"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": "# Simulate a pileup with three planted SNPs at different allele frequencies.\nsnps = {25: {'alt': 'T', 'af': 0.50},   # heterozygous\n        50: {'alt': 'G', 'af': 0.95},   # near-homozygous\n        75: {'alt': 'A', 'af': 0.08}}   # very-low-frequency / subclonal\nref_seq, pileup, sim_reads = simulate_pileup(\n    ref_len=100, n_reads=100, read_len=30, snp_positions=snps, error_rate=0.005)\n\nprint(f\"{len(sim_reads)} reads over a {len(ref_seq)} bp reference.\\n\")\nfor pos, info in snps.items():\n    p = pileup[pos]; af = p[info['alt']] / max(1, p['depth'])\n    print(f\"pos {pos:3d}: ref={ref_seq[pos]} alt={info['alt']}  depth={p['depth']:3d}  \"\n          f\"observed AF={af:.2f}  (planted {info['af']:.2f})\")"
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "# The three-panel pileup view (depth, allele frequency, base composition).\n",
    "plot_pileup(pileup, ref_seq, snps=snps)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`bcftools` would emit those calls as a **VCF** - the format Module 5 picks up:\n",
    "\n",
    "```text\n",
    "#CHROM  POS  ID  REF  ALT  QUAL  FILTER  INFO\n",
    "chr1     25   .   A    T    221   PASS    DP=50;AF=0.50    # het, called confidently\n",
    "chr1     50   .   G    C    255   PASS    DP=48;AF=0.95    # near-homozygous\n",
    "chr1     75   .   T    C     38   LowAF   DP=46;AF=0.10    # borderline subclonal\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## Exercises\n",
    "\n",
    "### Exercise 1: Align reads and score percent identity"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "reference = \"ATCGATCGATCGATCGATCGATCGATCGATCGATCGATCG\"\n",
    "for name, read in [(\"perfect\", \"ATCGATCGATCG\"), (\"one_snp\", \"ATCGATCTATCG\"),\n",
    "                   (\"deletion\", \"ATCGATCATCG\"), (\"low_sim\", \"GGGGGGGGGGG\")]:\n",
    "    H, T, score, pos = smith_waterman(read, reference)\n",
    "    aq, aln, ar = traceback(read, reference, H, T, pos)\n",
    "    pid = aln.count('|') / len(aln) * 100 if aln else 0.0\n",
    "    print(f\"{name:9s} score={score:3.0f}  identity={pid:5.1f}%   {aq}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 2: Visualise the DP matrix with its traceback path\n",
    "\n",
    "Use the toolkit's `traceback_path()` to collect the path cells and overlay them."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "ex_q, ex_r = \"GCATGCG\", \"GCATGGGCATGCG\"\n",
    "H, T, best_score, best_pos = smith_waterman(ex_q, ex_r)\n",
    "path = traceback_path(H, T, best_pos)\n",
    "print(f\"Best score {best_score:.0f} at {best_pos}; traceback visits {len(path)} cells.\")\n",
    "plot_sw_matrix(H, ex_q, ex_r, best_pos=best_pos, path=path,\n",
    "               title=\"SW matrix with traceback path\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 3: Call variants from the pileup\n",
    "\n",
    "Variant calling is counting. Implement the rule - call a position if the non-reference\n",
    "allele frequency is high enough *and* the depth is sufficient - and check it against the\n",
    "SNPs we planted."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": "def simple_variant_call(pileup, reference, min_af=0.2, min_depth=10):\n    calls = []\n    for pos in sorted(pileup):\n        c = pileup[pos]\n        if c['depth'] < min_depth:\n            continue\n        ref_base = reference[pos]\n        alt = max('ACGT', key=lambda b: c[b] if b != ref_base else -1)\n        af = c[alt] / c['depth']\n        if af >= min_af:\n            calls.append((pos, ref_base, alt, af, c['depth']))\n    return calls\n\ncalled = simple_variant_call(pileup, ref_seq)        # default min_af = 0.20\nfor pos, rb, alt, af, dp in called:\n    print(f\"  pos={pos:3d}  {rb}>{alt}  AF={af:.2f}  depth={dp}\")\n\ndetected = {pos for pos, *_ in called} & set(snps)\nprint(f\"\\nDetected {len(detected)}/{len(snps)} planted SNPs at min_af=0.20: {sorted(detected)}\")\nprint(\"The subclonal SNP at pos 75 (observed AF ~0.10) sits below the 0.20 default, so it is\")\nprint(\"missed here. Loosen min_af (e.g. to 0.10) in the pileup explorer above and it appears --\")\nprint(\"along with false positives from sequencing noise. That trade-off is variant calling.\")"
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## Summary\n",
    "\n",
    "| Step | Real command | What you read |\n",
    "|------|--------------|---------------|\n",
    "| Index + align | `bwa-mem2 index` ; `bwa-mem2 mem ref.fa reads.fq \\| samtools sort` | sorted BAM |\n",
    "| Alignment QC | `samtools flagstat aln.bam` | mapping / dup / proper-pair rates |\n",
    "| Coverage | `samtools depth -a aln.bam` | mean depth, % >=10x / >=30x |\n",
    "| Call variants | `samtools mpileup \\| bcftools call -mv` | a VCF of variants |\n",
    "\n",
    "**What ran where:** Smith-Waterman is the one algorithm we built by hand - it's the extend\n",
    "step inside `bwa-mem2`, and the explorer let you feel its scoring knobs. Everything else is\n",
    "the real `samtools`/`bcftools` workflow; the Python parsed `flagstat` and counted pileup\n",
    "bases - the two things you genuinely do by hand.\n",
    "\n",
    "**Key takeaways:**\n",
    "- `bwa-mem2` seeds with its index, then extends with Smith-Waterman - no longer a black box.\n",
    "- Gap penalty is the sensitivity-vs-specificity knob.\n",
    "- Always check alignment QC: mapping rate > 90%, duplication < 20%, even coverage.\n",
    "- Variant calling is counting pileup bases against AF + depth thresholds.\n",
    "\n",
    "> **In the live session** we open the sorted BAM in **IGV** (a desktop genome browser) to\n",
    "> eyeball coverage and the planted SNPs - every number it shows is one we computed above.\n",
    "\n",
    "**Next (Day 2):** Module 5 - BAM processing and variant calling, where these counts become\n",
    "filtered, annotated VCF records."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.10.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}