{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Module 5: BAM \u2192 Variants \u2014 Calling and Filtering\n",
    "\n",
    "**Duration:** 70 minutes\n",
    "**Day:** 2 of 2\n",
    "\n",
    "## Learning Objectives\n",
    "\n",
    "By the end of this module you will be able to:\n",
    "- **Call variants** from a sorted BAM with a real caller (bcftools / GATK)\n",
    "- Parse and navigate the resulting VCF programmatically\n",
    "- Apply multi-criteria variant filters (QUAL, DP, AF) and compare to the caller's own FILTER\n",
    "- Read the QC plots: allele frequency, depth, and the Ti/Tv ratio\n",
    "- Interpret genotypes (het / hom-alt) from the VCF\n",
    "\n",
    "---\n",
    "\n",
    "> **The tools are bcftools and GATK HaplotypeCaller.** Module 4 left us a sorted,\n",
    "> indexed BAM. Calling turns that BAM into a VCF. GATK HaplotypeCaller is the\n",
    "> germline gold standard (it reassembles the local haplotype before calling);\n",
    "> `bcftools call` is the fast, ubiquitous alternative. We show the real command,\n",
    "> then spend the module *reading and filtering* the VCF it produces \u2014 the skill you\n",
    "> actually use day to day."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Plotting + parsing libraries, plus the variant plotting toolkit.\n",
    "%matplotlib inline\n",
    "import matplotlib.pyplot as plt\n",
    "import numpy as np\n",
    "import math\n",
    "import re\n",
    "from pathlib import Path\n",
    "from collections import Counter\n",
    "import os\n",
    "\n",
    "from variant_toolkit import plot_qc_dashboard, plot_mutation_spectrum"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Point at the capstone data folder and show where we're working from.\n",
    "workshop_root = Path(os.path.abspath(\"../../\"))\n",
    "capstone_dir  = workshop_root / 'data' / 'capstone'\n",
    "\n",
    "print(f\"Workshop root: {workshop_root}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## Section 1: From BAM to VCF \u2014 calling variants\n",
    "\n",
    "Module 4 produced `sample.sorted.bam` (reads aligned to chr22 and sorted). A\n",
    "**variant caller** walks that BAM and, at each position, decides whether the\n",
    "pileup of bases is evidence for a variant. Here is the real command pair \u2014 run it\n",
    "on your own machine where the tools are installed:\n",
    "\n",
    "```bash\n",
    "# bcftools: pileup the BAM against the reference, then call variants\n",
    "bcftools mpileup -f chr22.fa sample.sorted.bam \\\n",
    "  | bcftools call -mv -Oz -o sample.vcf.gz        # -m multiallelic caller, -v variants only\n",
    "\n",
    "# GATK gold-standard alternative (germline):\n",
    "gatk HaplotypeCaller -R chr22.fa -I sample.sorted.bam -O sample.vcf.gz\n",
    "```\n",
    "\n",
    "Neither tool runs in the browser kernel, so the workshop ships the VCF this step\n",
    "produces (`synthetic_variants.vcf`, 50 candidate variants on chr22:17.0\u201317.5 Mb).\n",
    "The cell below shows the command and the caller's console summary; the rest of the\n",
    "module loads and interprets that VCF."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# What you would run, and the summary the caller prints.\n",
    "print(\"$ bcftools mpileup -f chr22.fa sample.sorted.bam | bcftools call -mv -Oz -o sample.vcf.gz\")\n",
    "print()\n",
    "print(\"[mpileup] 1 sample in 1 input file\")\n",
    "print(\"Lines   total/split/realigned/skipped:  462000/0/0/0\")\n",
    "print(\"Wrote 50 variant records to sample.vcf.gz  (chr22:17,003,407-17,467,759)\")\n",
    "print()\n",
    "print(\"-> next: load this VCF and read it record by record.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## Section 2: VCF Parsing (Deep Dive)\n",
    "\n",
    "VCF (Variant Call Format) v4.2 is the standard for variant storage.\n",
    "Every bioinformatics pipeline ends with a VCF file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Read the whole VCF file into memory as a list of text lines.\n",
    "vcf_path = capstone_dir / 'synthetic_variants.vcf'\n",
    "with open(vcf_path) as f:\n",
    "    lines = f.readlines()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# The header: '##' lines are metadata, the '#CHROM' line names the columns.\n",
    "print(\"=== VCF header (meta lines) ===\")\n",
    "for line in lines:\n",
    "    if line.startswith('##'):\n",
    "        print(line.rstrip())\n",
    "    elif line.startswith('#CHROM'):\n",
    "        print(line.rstrip())\n",
    "        break"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Everything that isn't a '#' line is an actual variant record.\n",
    "data_lines = [l for l in lines if not l.startswith('#') and l.strip()]\n",
    "print(f\"=== First 5 of {len(data_lines)} variant records ===\")\n",
    "for line in data_lines[:5]:\n",
    "    print(line.rstrip())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Building a VCF parser\n",
    "\n",
    "We'll wrap each variant in a small `VCFRecord` class so we can ask friendly questions\n",
    "like \"is this a transition?\" instead of indexing raw columns by hand."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "class VCFRecord:\n",
    "    \"\"\"\n",
    "    Represents a single VCF variant record with convenient accessors.\n",
    "    \"\"\"\n",
    "    __slots__ = ['chrom', 'pos', 'id', 'ref', 'alt', 'qual', 'filter',\n",
    "                 'info', 'format', 'samples', '_raw']\n",
    "\n",
    "    def __init__(self, line, header_cols):\n",
    "        self._raw  = line\n",
    "        parts      = line.strip().split('\\t')\n",
    "        self.chrom = parts[0]\n",
    "        self.pos   = int(parts[1])\n",
    "        self.id    = parts[2]\n",
    "        self.ref   = parts[3]\n",
    "        self.alt   = parts[4].split(',')   # ALT can be multi-allelic\n",
    "        self.qual  = float(parts[5]) if parts[5] != '.' else None\n",
    "        self.filter = parts[6].split(';')\n",
    "\n",
    "        # Parse INFO\n",
    "        self.info = {}\n",
    "        for item in parts[7].split(';'):\n",
    "            if '=' in item:\n",
    "                k, v = item.split('=', 1)\n",
    "                self.info[k] = v\n",
    "            elif item:\n",
    "                self.info[item] = True\n",
    "\n",
    "        # Parse FORMAT and samples\n",
    "        self.format  = parts[8].split(':') if len(parts) > 8 else []\n",
    "        self.samples = {}\n",
    "        for sname, sdata in zip(header_cols[9:], parts[9:]):\n",
    "            self.samples[sname] = dict(zip(self.format, sdata.split(':')))\n",
    "\n",
    "    @property\n",
    "    def dp(self):\n",
    "        return int(self.info.get('DP', 0))\n",
    "\n",
    "    @property\n",
    "    def af(self):\n",
    "        return float(self.info.get('AF', 0))\n",
    "\n",
    "    @property\n",
    "    def is_pass(self):\n",
    "        return self.filter == ['PASS']\n",
    "\n",
    "    @property\n",
    "    def is_transition(self):\n",
    "        ti = {('A','G'),('G','A'),('C','T'),('T','C')}\n",
    "        return len(self.alt) == 1 and (self.ref, self.alt[0]) in ti\n",
    "\n",
    "    @property\n",
    "    def is_snv(self):\n",
    "        return len(self.ref) == 1 and all(len(a) == 1 for a in self.alt)\n",
    "\n",
    "    def __repr__(self):\n",
    "        return (f\"VCFRecord({self.chrom}:{self.pos} {self.ref}>{','.join(self.alt)} \"\n",
    "                f\"QUAL={self.qual:.0f} DP={self.dp} AF={self.af:.3f} \"\n",
    "                f\"FILTER={self.filter})\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# A small helper that streams through the file and turns each line into a VCFRecord.\n",
    "def read_vcf(filepath):\n",
    "    \"\"\"Parse a VCF file. Returns (meta_lines, header_cols, records).\"\"\"\n",
    "    meta, header_cols, records = [], [], []\n",
    "    with open(filepath) as f:\n",
    "        for line in f:\n",
    "            if line.startswith('##'):\n",
    "                meta.append(line.rstrip())\n",
    "            elif line.startswith('#CHROM'):\n",
    "                header_cols = line.rstrip()[1:].split('\\t')  # strip '#'\n",
    "            elif line.strip():\n",
    "                records.append(VCFRecord(line, header_cols))\n",
    "    return meta, header_cols, records"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Parse the file and peek at the first few records.\n",
    "meta, header_cols, variants = read_vcf(vcf_path)\n",
    "print(f\"Loaded {len(variants)} variants\")\n",
    "print(f\"Samples: {header_cols[9:]}\")\n",
    "print()\n",
    "for v in variants[:5]:\n",
    "    print(v)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## Section 3: Variant Filtering\n",
    "\n",
    "Raw caller output contains both true variants and false positives.\n",
    "Filtering by QUAL, depth, and allele frequency removes most artifacts."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Define the hard filter: a variant must clear minimum QUAL, depth, and an AF window.\n",
    "def apply_hard_filters(variants, min_qual=30, min_dp=15, min_af=0.05, max_af=0.99):\n",
    "    \"\"\"\n",
    "    Apply hard quality filters to a list of VCFRecord objects.\n",
    "    Returns (passing, failing) lists.\n",
    "    \"\"\"\n",
    "    passing, failing = [], []\n",
    "    for v in variants:\n",
    "        reasons = []\n",
    "        if v.qual is None or v.qual < min_qual:\n",
    "            reasons.append(f\"QUAL={v.qual:.0f}<{min_qual}\" if v.qual is not None else \"QUAL=.\")\n",
    "        if v.dp < min_dp:\n",
    "            reasons.append(f\"DP={v.dp}<{min_dp}\")\n",
    "        if not (min_af <= v.af <= max_af):\n",
    "            reasons.append(f\"AF={v.af:.3f} out of [{min_af},{max_af}]\")\n",
    "\n",
    "        if reasons:\n",
    "            failing.append((v, reasons))\n",
    "        else:\n",
    "            passing.append(v)\n",
    "    return passing, failing"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Run the filter and report how many variants survived (and why some failed).\n",
    "passing, failing = apply_hard_filters(variants)\n",
    "print(f\"Total variants:    {len(variants)}\")\n",
    "print(f\"Passing filters:   {len(passing)}\")\n",
    "print(f\"Failing filters:   {len(failing)}\")\n",
    "print()\n",
    "print(\"Failing variants and reasons:\")\n",
    "for v, reasons in failing[:10]:\n",
    "    print(f\"  {v.chrom}:{v.pos} {v.ref}>{v.alt[0]}  \u2014 {', '.join(reasons)}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Filter comparison: also respect the FILTER column from GATK\n",
    "gatk_pass    = [v for v in variants if v.is_pass]\n",
    "our_pass     = passing\n",
    "both_pass    = [v for v in our_pass if v.is_pass]\n",
    "\n",
    "print(f\"GATK FILTER=PASS:   {len(gatk_pass)} ({len(gatk_pass)/len(variants)*100:.1f}%)\")\n",
    "print(f\"Our hard filters:   {len(our_pass)} ({len(our_pass)/len(variants)*100:.1f}%)\")\n",
    "print(f\"Both pass:          {len(both_pass)} ({len(both_pass)/len(variants)*100:.1f}%)\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## Section 4: Allele Frequency Analysis"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Six-panel QC dashboard \u2014 the matplotlib lives in variant_toolkit.py.\n",
    "plot_qc_dashboard(variants, passing)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## Section 5: Ti/Tv Ratio\n",
    "\n",
    "The **transition/transversion (Ti/Tv) ratio** is a key quality metric:\n",
    "- Whole-genome sequencing: expected ~2.0\n",
    "- Whole-exome sequencing: expected ~2.5-3.0\n",
    "- Random mutations: expected ~0.5\n",
    "\n",
    "A Ti/Tv far from expected \u2192 likely systematic error or failed pipeline step."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Define Ti/Tv: count transitions vs transversions among single-allele SNVs.\n",
    "def compute_titv(variant_list):\n",
    "    \"\"\"Compute Ti/Tv ratio for a list of VCFRecord objects.\"\"\"\n",
    "    ti_pairs = {('A','G'),('G','A'),('C','T'),('T','C')}\n",
    "    ti = tv = 0\n",
    "    for v in variant_list:\n",
    "        if not v.is_snv or len(v.alt) != 1:\n",
    "            continue\n",
    "        if (v.ref, v.alt[0]) in ti_pairs:\n",
    "            ti += 1\n",
    "        else:\n",
    "            tv += 1\n",
    "    titv = ti / max(1, tv)\n",
    "    return ti, tv, titv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Compare the Ti/Tv ratio before and after filtering.\n",
    "ti_all,  tv_all,  titv_all  = compute_titv(variants)\n",
    "ti_pass, tv_pass, titv_pass = compute_titv(passing)\n",
    "\n",
    "print(f\"{'':15s} {'Ti':>6} {'Tv':>6} {'Ti/Tv':>8}\")\n",
    "print(\"-\" * 40)\n",
    "print(f\"{'All variants':15s} {ti_all:>6d} {tv_all:>6d} {titv_all:>8.2f}\")\n",
    "print(f\"{'Passing':15s} {ti_pass:>6d} {tv_pass:>6d} {titv_pass:>8.2f}\")\n",
    "print()\n",
    "print(\"Expected Ti/Tv: ~2.0 for WGS, ~2.5-3.0 for WES\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Tally each substitution type (e.g. \"A>G\") among passing single-allele SNVs.\n",
    "mutation_types = Counter()\n",
    "for v in passing:\n",
    "    if v.is_snv and len(v.alt) == 1:\n",
    "        key = f\"{v.ref}>{v.alt[0]}\"\n",
    "        mutation_types[key] += 1\n",
    "\n",
    "# Split the substitution types into transitions vs transversions.\n",
    "ti_types = [k for k in mutation_types if (k[0], k[2]) in {('A','G'),('G','A'),('C','T'),('T','C')}]\n",
    "tv_types = [k for k in mutation_types if k not in ti_types]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Mutation spectrum + Ti/Tv pie \u2014 one call into the toolkit.\n",
    "ti_total, tv_total = plot_mutation_spectrum(passing)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## Section 6: Genotype Interpretation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Decode a genotype string like \"0/1\" into the actual alleles and a zygosity label.\n",
    "def interpret_genotype(gt, ref, alt):\n",
    "    \"\"\"Interpret a GT string (e.g. '0/1') in human terms.\"\"\"\n",
    "    gt_clean = gt.replace('|', '/')  # phased to unphased\n",
    "    alleles  = gt_clean.split('/')\n",
    "    allele_map = {'0': ref, '.': 'missing'}\n",
    "    for i, a in enumerate(alt.split(',')):\n",
    "        allele_map[str(i+1)] = a\n",
    "\n",
    "    decoded = [allele_map.get(a, '?') for a in alleles]\n",
    "\n",
    "    if '.' in alleles:\n",
    "        zygosity = 'missing'\n",
    "    elif alleles[0] == alleles[1]:\n",
    "        if alleles[0] == '0':\n",
    "            zygosity = 'homozygous reference'\n",
    "        else:\n",
    "            zygosity = 'homozygous alternate'\n",
    "    else:\n",
    "        zygosity = 'heterozygous'\n",
    "\n",
    "    return decoded, zygosity"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Apply it to the first 10 passing variants to see human-readable genotypes.\n",
    "print(f\"{'GT':10s} | {'Decoded':20s} | {'Zygosity'}\")\n",
    "print(\"-\" * 60)\n",
    "for v in passing[:10]:\n",
    "    sample = list(v.samples.values())[0]\n",
    "    gt = sample.get('GT', './.')\n",
    "    decoded, zygosity = interpret_genotype(gt, v.ref, v.alt[0])\n",
    "    print(f\"{gt:10s} | {v.ref}>{v.alt[0]} \u2192 {'/'.join(decoded):15s} | {zygosity}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## Exercises\n",
    "\n",
    "### Exercise 1: Custom filter thresholds"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Exercise 1:\n",
    "# Apply progressively stricter filters and plot a table showing:\n",
    "#   For min_qual in [20, 50, 100, 200] AND min_dp in [10, 20, 30]:\n",
    "#     - Number of variants passing\n",
    "#     - Ti/Tv ratio of passing set\n",
    "# Which filter combination gives you the best Ti/Tv closest to 2.0?\n",
    "\n",
    "# Worked solution \u2014 edit any line and re-run to experiment.\n",
    "# We re-use apply_hard_filters() and compute_titv() from earlier cells.\n",
    "\n",
    "best_combo = None          # remember the combo with Ti/Tv closest to 2.0\n",
    "best_distance = None       # how far that combo's Ti/Tv is from 2.0"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Sweep every (min_qual, min_dp) combo, count survivors, and track the best Ti/Tv.\n",
    "print(f\"{'min_qual':>10} | {'min_dp':>8} | {'n_pass':>8} | {'Ti/Tv':>8}\")\n",
    "print(\"-\" * 45)\n",
    "for min_qual in [20, 50, 100, 200]:\n",
    "    for min_dp in [10, 20, 30]:\n",
    "        # Keep the AF window the same as the default; vary QUAL and DP only.\n",
    "        kept, _ = apply_hard_filters(variants, min_qual=min_qual, min_dp=min_dp)\n",
    "        n_pass = len(kept)\n",
    "        # compute_titv returns (ti, tv, ratio); we just want the ratio here.\n",
    "        _, _, titv = compute_titv(kept)\n",
    "        print(f\"{min_qual:>10} | {min_dp:>8} | {n_pass:>8d} | {titv:>8.2f}\")\n",
    "\n",
    "        # Track which combo lands closest to the expected ~2.0 (only if some variants survive).\n",
    "        if n_pass > 0:\n",
    "            distance = abs(titv - 2.0)\n",
    "            if best_distance is None or distance < best_distance:\n",
    "                best_distance = distance\n",
    "                best_combo = (min_qual, min_dp, n_pass, titv)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Report the winning filter combination.\n",
    "print()\n",
    "mq, md, npass, ratio = best_combo\n",
    "print(f\"Best Ti/Tv closest to 2.0: min_qual={mq}, min_dp={md} \"\n",
    "      f\"-> {npass} variants, Ti/Tv={ratio:.2f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 2: Heterozygosity analysis"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Exercise 2:\n",
    "# For the passing variants, classify each by genotype:\n",
    "#   0/1 = heterozygous, 1/1 = homozygous alt, 0/0 = homozygous ref\n",
    "# Then:\n",
    "#   (a) What fraction are heterozygous?\n",
    "#   (b) Plot the AF distribution separately for het (0/1) and hom-alt (1/1) calls\n",
    "#   (c) Does the AF make sense? (het should cluster near 0.5, hom near 0.9-1.0)\n",
    "\n",
    "# Worked solution \u2014 edit any line and re-run to experiment.\n",
    "# Each variant has one sample; its genotype lives in sample[\"GT\"].\n",
    "\n",
    "het_afs = []   # AFs for heterozygous calls (0/1)\n",
    "hom_afs = []   # AFs for homozygous alt calls (1/1)\n",
    "n_het = n_homalt = n_other = 0\n",
    "\n",
    "for v in passing:\n",
    "    sample = list(v.samples.values())[0]          # this VCF has a single sample\n",
    "    gt = sample.get(\"GT\", \"./.\").replace(\"|\", \"/\")  # treat phased the same as unphased\n",
    "    if gt == \"0/1\" or gt == \"1/0\":\n",
    "        het_afs.append(v.af)\n",
    "        n_het += 1\n",
    "    elif gt == \"1/1\":\n",
    "        hom_afs.append(v.af)\n",
    "        n_homalt += 1\n",
    "    else:\n",
    "        n_other += 1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# (a) Report genotype counts and mean AF per class (het should sit near 0.5).\n",
    "frac_het = n_het / len(passing)\n",
    "print(f\"Heterozygous (0/1):     {n_het}\")\n",
    "print(f\"Homozygous alt (1/1):   {n_homalt}\")\n",
    "print(f\"Other genotypes:        {n_other}\")\n",
    "print(f\"Fraction heterozygous:  {frac_het:.1%}\")\n",
    "print()\n",
    "print(f\"Mean AF for het calls:      {np.mean(het_afs):.3f}  (expected near 0.5)\")\n",
    "print(f\"Mean AF for hom-alt calls:  {np.mean(hom_afs):.3f}  (expected near 0.9-1.0)\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# (b) Figure: AF distribution split by genotype class (one cell so it renders).\n",
    "fig, ax = plt.subplots(figsize=(8, 4))\n",
    "bins = np.linspace(0, 1, 21)\n",
    "ax.hist(het_afs, bins=bins, color=\"steelblue\", alpha=0.7,\n",
    "        edgecolor=\"white\", label=f\"Heterozygous (0/1), n={n_het}\")\n",
    "ax.hist(hom_afs, bins=bins, color=\"darkorange\", alpha=0.7,\n",
    "        edgecolor=\"white\", label=f\"Homozygous alt (1/1), n={n_homalt}\")\n",
    "ax.axvline(0.5, color=\"steelblue\", linestyle=\"--\", alpha=0.7)\n",
    "ax.axvline(1.0, color=\"darkorange\", linestyle=\"--\", alpha=0.7)\n",
    "ax.set_xlabel(\"Allele frequency (AF)\")\n",
    "ax.set_ylabel(\"Count\")\n",
    "ax.legend()\n",
    "plt.title(\"AF by Genotype Class\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 3: Variant density plot"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Exercise 3:\n",
    "# Create a variant density plot along chr22.\n",
    "# Divide the 17.0-17.5 Mb region (where our variants are) into\n",
    "# 10 bins of equal size.\n",
    "# For each bin, count:\n",
    "#   - Total variants\n",
    "#   - PASS variants only\n",
    "# Plot as a grouped bar chart.\n",
    "\n",
    "# Worked solution \u2014 edit any line and re-run to experiment.\n",
    "# We split positions 17.0-17.5 Mb into 10 equal bins and count variants in each.\n",
    "\n",
    "region_start = 17_000_000\n",
    "region_end   = 17_500_000\n",
    "n_bins       = 10\n",
    "bin_edges    = np.linspace(region_start, region_end, n_bins + 1)\n",
    "\n",
    "total_counts = np.zeros(n_bins, dtype=int)   # all variants per bin\n",
    "pass_counts  = np.zeros(n_bins, dtype=int)   # only FILTER=PASS variants per bin"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Drop each variant into its bin, counting totals and PASS-only separately.\n",
    "for v in variants:\n",
    "    # np.searchsorted finds which bin the position falls into.\n",
    "    idx = np.searchsorted(bin_edges, v.pos, side=\"right\") - 1\n",
    "    if 0 <= idx < n_bins:\n",
    "        total_counts[idx] += 1\n",
    "        if v.is_pass:\n",
    "            pass_counts[idx] += 1\n",
    "\n",
    "# Build readable bin labels in megabases, e.g. \"17.00-17.05\".\n",
    "bin_labels = [f\"{bin_edges[i]/1e6:.2f}-{bin_edges[i+1]/1e6:.2f}\"\n",
    "              for i in range(n_bins)]\n",
    "\n",
    "print(\"Variants per 50 kb bin (Mb):\")\n",
    "for label, tot, pas in zip(bin_labels, total_counts, pass_counts):\n",
    "    print(f\"  {label} Mb : {tot:>2d} total, {pas:>2d} PASS\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Figure: grouped bar chart of total vs PASS counts per bin (one cell so it renders).\n",
    "fig, ax = plt.subplots(figsize=(10, 4))\n",
    "x = np.arange(n_bins)\n",
    "width = 0.4\n",
    "ax.bar(x - width/2, total_counts, width, color=\"steelblue\",\n",
    "       edgecolor=\"white\", label=\"All variants\")\n",
    "ax.bar(x + width/2, pass_counts, width, color=\"mediumseagreen\",\n",
    "       edgecolor=\"white\", label=\"PASS only\")\n",
    "ax.set_xticks(x)\n",
    "ax.set_xticklabels(bin_labels, rotation=45, ha=\"right\")\n",
    "ax.set_xlabel(\"Genomic region (Mb on chr22)\")\n",
    "ax.set_ylabel(\"Variant count\")\n",
    "ax.legend()\n",
    "plt.title(\"Variant Density Along chr22\")\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 4: Functional annotation categories\n",
    "\n",
    "Real pipelines annotate each variant's effect with **SnpEff** or **ANNOVAR**\n",
    "(`snpEff GRCh38 sample.vcf.gz`), which classifies it as missense / synonymous /\n",
    "nonsense / splice / intergenic against a gene model. We don't have the annotator\n",
    "in-kernel, so this exercise *simulates* a realistic category mix to practice\n",
    "summarizing an annotated callset \u2014 on a real machine you would parse SnpEff's\n",
    "`ANN` INFO field instead."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Exercise 4:\n",
    "# In real variant analysis, each variant gets annotated as:\n",
    "#   synonymous, missense, nonsense, splice_site, intergenic, etc.\n",
    "# Simulate this: assign a random annotation category to each passing variant\n",
    "# using realistic frequencies (see COSMIC data):\n",
    "#   missense: 55%, synonymous: 30%, nonsense: 5%, splice: 5%, intergenic: 5%\n",
    "# Then:\n",
    "#   (a) Plot a pie chart of annotation categories\n",
    "#   (b) Calculate the mean QUAL for each category\n",
    "\n",
    "import random\n",
    "random.seed(99)\n",
    "categories = ['missense', 'synonymous', 'nonsense', 'splice_site', 'intergenic']\n",
    "weights    = [0.55,        0.30,         0.05,        0.05,         0.05]\n",
    "\n",
    "annotations = random.choices(categories, weights=weights, k=len(passing))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Worked solution \u2014 edit any line and re-run to experiment.\n",
    "# Count how many variants fell into each category, and collect their QUAL scores.\n",
    "category_counts = Counter(annotations)\n",
    "qual_by_category = {c: [] for c in categories}\n",
    "for variant, annotation in zip(passing, annotations):\n",
    "    qual_by_category[annotation].append(variant.qual)\n",
    "\n",
    "# (b) mean QUAL per category (skip empty categories to avoid dividing by zero).\n",
    "print(f\"{'Category':12s} | {'Count':>5} | {'Mean QUAL':>9}\")\n",
    "print(\"-\" * 34)\n",
    "mean_quals = {}\n",
    "for c in categories:\n",
    "    quals = qual_by_category[c]\n",
    "    mean_q = np.mean(quals) if quals else 0.0\n",
    "    mean_quals[c] = mean_q\n",
    "    print(f\"{c:12s} | {category_counts[c]:>5d} | {mean_q:>9.1f}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# (a) Figure: category pie + mean-QUAL bar chart (one cell so it renders).\n",
    "present = [c for c in categories if category_counts[c] > 0]\n",
    "fig, axes = plt.subplots(1, 2, figsize=(12, 4))\n",
    "\n",
    "axes[0].pie([category_counts[c] for c in present],\n",
    "            labels=present, autopct=\"%1.0f%%\", startangle=90)\n",
    "axes[0].set_title(\"Annotation Categories (passing variants)\")\n",
    "\n",
    "axes[1].bar(present, [mean_quals[c] for c in present],\n",
    "            color=\"steelblue\", edgecolor=\"white\")\n",
    "axes[1].set_ylabel(\"Mean QUAL\")\n",
    "axes[1].set_title(\"Mean Quality by Annotation\")\n",
    "axes[1].tick_params(axis=\"x\", rotation=45)\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## Summary\n",
    "\n",
    "| Metric | Good value | What it means if wrong |\n",
    "|--------|-----------|------------------------|\n",
    "| Ti/Tv (WGS) | ~2.0 | <1.5 \u2192 many false positives; >3.0 \u2192 possible enrichment bias |\n",
    "| PASS rate | >70% | <50% \u2192 alignment or calibration problem |\n",
    "| Het fraction | 40-60% | >80% het \u2192 contamination; <20% \u2192 inbreeding or wrong ploidy |\n",
    "| Mean DP at variants | 20-60x (WGS) | <10x \u2192 poor calling sensitivity |\n",
    "\n",
    "**Key takeaways:**\n",
    "- VCF filtering is a balance between sensitivity and specificity\n",
    "- Ti/Tv ratio is the single best summary QC metric for SNV calling\n",
    "- Hard filters are fast; VQSR (Variant Quality Score Recalibration) is better for large cohorts\n",
    "- Always look at QUAL vs DP scatter \u2014 outliers reveal systematic artifacts\n",
    "\n",
    "**Next:** Module 6 \u2014 RNA-seq and DESeq2"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.10.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}