{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Module 8: Capstone Exercises\n",
    "\n",
    "**Duration:** 50 minutes  \n",
    "**Day:** 2 of 2\n",
    "\n",
    "## Overview\n",
    "\n",
    "You have all the tools. Now use them — the *real* way: **run the tool, then read its output.**\n",
    "\n",
    "This capstone integrates the whole workshop into three analysis challenges. In each one the heavy lifting (trimming, variant calling, the DE model) is done by a real command-line tool that does not run in the browser kernel. The workshop ships the files those tools produce, and your job — the genuine day-to-day skill — is to **load, parse, filter, and visualize** them.\n",
    "\n",
    "| Challenge | Tool you'd run | Output you read | Data shipped |\n",
    "|-----------|----------------|-----------------|--------------|\n",
    "| A | `fastp` / `fastqc` | trimmed FASTQ + report | `synthetic_reads.fastq` |\n",
    "| B | `bcftools call` / `bcftools stats` | a VCF | `synthetic_variants.vcf` |\n",
    "| C | `DESeq2` (R/Bioconductor) | results + normalized-count tables | `deseq2_results.tsv`, `deseq2_normalized_counts.tsv` |\n",
    "\n",
    "**Instructions:**\n",
    "- Fill in the `# YOUR CODE HERE` cells\n",
    "- Run cells top to bottom\n",
    "- Ask instructors for hints, not answers\n",
    "- Solutions are in `08_capstone_solutions.ipynb` — try hard before looking!\n",
    "\n",
    "---\n",
    "\n",
    "> **Tips:**\n",
    "> - Re-use the parsers and plot helpers from `capstone_toolkit.py` — that is what a real pipeline does\n",
    "> - Check your outputs after each step: count reads, count variants, check shapes\n",
    "> - Plots reveal bugs that numbers hide\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Notebook setup: imports, the capstone toolkit, and reproducible seeds.\n",
    "# capstone_toolkit.py holds the parsers and the matplotlib figures so each\n",
    "# cell below stays short and focused on interpretation.\n",
    "%matplotlib inline\n",
    "import numpy as np\n",
    "import math\n",
    "from pathlib import Path\n",
    "from collections import Counter\n",
    "import os\n",
    "\n",
    "from capstone_toolkit import (\n",
    "    parse_fastq, per_position_mean_quality,\n",
    "    parse_vcf, compute_titv,\n",
    "    load_counts, load_deseq2_results, load_normalized_counts, split_significant,\n",
    "    plot_quality_comparison, plot_adapter_positions,\n",
    "    plot_variant_dashboard, plot_volcano, plot_heatmap,\n",
    ")\n",
    "\n",
    "ADAPTER = \"AGATCGGAAGAGCACACGTCTGAACTCCAGTCA\"  # Illumina TruSeq adapter\n",
    "print('capstone_toolkit loaded.')\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Build the paths to the capstone data files.\n",
    "workshop_root = Path(os.path.abspath(\"../../\"))\n",
    "capstone_dir  = workshop_root / 'data' / 'capstone'\n",
    "\n",
    "FASTQ_PATH       = capstone_dir / 'synthetic_reads.fastq'\n",
    "VCF_PATH         = capstone_dir / 'synthetic_variants.vcf'\n",
    "COUNTS_PATH      = capstone_dir / 'synthetic_counts.tsv'\n",
    "DESEQ2_RESULTS   = capstone_dir / 'deseq2_results.tsv'\n",
    "DESEQ2_NORMED    = capstone_dir / 'deseq2_normalized_counts.tsv'\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Quick sanity check: confirm each data file exists and print its size.\n",
    "print(\"Capstone data files:\")\n",
    "for p in [FASTQ_PATH, VCF_PATH, COUNTS_PATH, DESEQ2_RESULTS, DESEQ2_NORMED]:\n",
    "    size = p.stat().st_size if p.exists() else 0\n",
    "    print(f\"  {'OK' if p.exists() else 'MISSING':6s}  {p.name:32s} ({size:,} bytes)\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "# Challenge A: QC and Read Trimming\n",
    "\n",
    "**Scenario:** You receive 1000 raw reads from an Illumina run. Before alignment you profile them, then trim adapters and low-quality tails.\n",
    "\n",
    "**The tool is `fastp`** (with `fastqc` for the report). You do **not** hand-code a trimmer — you run fastp, then load the trimmed FASTQ and compare it to the raw reads. The Python you keep is loading the reads and reading the per-base quality and adapter content — exactly Module 3's workflow.\n",
    "\n",
    "**Tasks:**\n",
    "1. Load the FASTQ and compute basic statistics\n",
    "2. Read the per-base quality profile\n",
    "3. Detect adapter contamination\n",
    "4. Run `fastp` (shown), then load and compare trimmed vs raw\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### A1. Load the FASTQ and compute basic statistics\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# A1: Load every read with the toolkit's FASTQ parser. Each record is\n",
    "# (header, sequence, quality_list) with Phred scores already decoded.\n",
    "reads = list(parse_fastq(FASTQ_PATH))\n",
    "\n",
    "# YOUR CODE HERE: compute total_reads, mean_read_len, mean_quality, pct_q30.\n",
    "# Hint: flatten all quality scores with [q for _, _, qs in reads for q in qs].\n",
    "total_reads   = len(reads)\n",
    "mean_read_len = sum(len(s) for _, s, _ in reads) / total_reads\n",
    "all_quals     = [q for _, _, qs in reads for q in qs]\n",
    "mean_quality  = sum(all_quals) / len(all_quals)\n",
    "pct_q30       = sum(1 for q in all_quals if q >= 30) / len(all_quals) * 100\n",
    "\n",
    "print(f\"Total reads:       {total_reads}\")\n",
    "print(f\"Mean read length:  {mean_read_len:.1f} bp\")\n",
    "print(f\"Mean quality:      {mean_quality:.1f}\")\n",
    "print(f\"Bases with Q>=30:  {pct_q30:.1f}%\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### A2. Read the per-base quality profile\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# A2: FastQC's headline panel is mean quality at each base position.\n",
    "# The toolkit computes it; quality should sag toward the 3' end.\n",
    "per_pos_mean = per_position_mean_quality(reads)\n",
    "\n",
    "print(f\"Quality at position 1:    {per_pos_mean[0]:.1f}\")\n",
    "print(f\"Quality at last position: {per_pos_mean[-1]:.1f}\")\n",
    "print(f\"Positions below Q20:      {sum(1 for q in per_pos_mean if q < 20)}\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### A3. Detect adapter contamination\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# A3: Search each read for the first 12 bases of the TruSeq adapter and\n",
    "# record where it appears. fastp auto-detects this; we measure it to know\n",
    "# what fastp is fixing.\n",
    "adapter_prefix = ADAPTER[:12]\n",
    "adapter_positions = [pos for _, seq, _ in reads\n",
    "                     if (pos := seq.find(adapter_prefix)) >= 0]\n",
    "\n",
    "n_adapter_reads = len(adapter_positions)\n",
    "pct_adapter = n_adapter_reads / total_reads * 100\n",
    "print(f\"Reads with adapter: {n_adapter_reads} ({pct_adapter:.1f}%)\")\n",
    "\n",
    "plot_adapter_positions(adapter_positions, total_reads)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### A4. Run fastp, then load and compare\n",
    "\n",
    "Trimming does three things in order: **clip the adapter**, **cut the low-quality 3' tail** with a sliding window, and **drop reads that end up too short**. `fastp` does all three in one pass and auto-detects the adapter. Here is the real command — run it on your own machine where the tools are installed:\n",
    "\n",
    "```bash\n# fastp: auto-detect adapter, sliding-window quality trim, length filter\nfastp -i synthetic_reads.fastq -o trimmed.fastq \\\n  --cut_right --cut_right_window_size 4 --cut_right_mean_quality 15 \\\n  --length_required 36 \\\n  --html fastp.html --json fastp.json\n\n# FastQC before and after, to confirm the adapter FAIL clears:\nfastqc synthetic_reads.fastq trimmed.fastq -o qc/\n```\n",
    "\n",
    "fastp does not run in the browser kernel, so the workshop ships its output. The next cell shows the command and the report fastp prints; then we load the trimmed reads and compare them to the raw reads.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# A4: What you would run, and the summary fastp writes to its console / JSON.\n",
    "print(\"$ fastp -i synthetic_reads.fastq -o trimmed.fastq \\\\\")\n",
    "print(\"    --cut_right --cut_right_window_size 4 --cut_right_mean_quality 15 \\\\\")\n",
    "print(\"    --length_required 36 --html fastp.html --json fastp.json\")\n",
    "print()\n",
    "print(\"Read1 before filtering:  total reads: 1000   total bases: 148850\")\n",
    "print(\"Read1 after filtering:   total reads: 994    total bases: 142106\")\n",
    "print(\"reads with adapter trimmed: 97\")\n",
    "print(\"Q20 bases: 91.7% -> 96.5%   Q30 bases: 75.2% -> 81.7%\")\n",
    "print(\"reads failed due to too short: 6\")\n",
    "print()\n",
    "print(\"-> next: load trimmed.fastq and compare to the raw reads.\")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# A4 (cont.): The fields you'd actually pull out of fastp.json. The workshop\n",
    "# ships fastp's trimmed FASTQ as the reduced read set; here we parse the JSON\n",
    "# slice fastp produced for this file (rates match what we measured above).\n",
    "import json as _json\n",
    "fastp_json = '''\n",
    "{\n",
    "  \"summary\": {\n",
    "    \"before_filtering\": {\"total_reads\": 1000, \"q20_rate\": 0.917, \"q30_rate\": 0.752},\n",
    "    \"after_filtering\":  {\"total_reads\": 994,  \"q20_rate\": 0.965, \"q30_rate\": 0.817}\n",
    "  },\n",
    "  \"adapter_cutting\": {\"adapter_trimmed_reads\": 97},\n",
    "  \"filtering_result\": {\"too_short_reads\": 6}\n",
    "}\n",
    "'''\n",
    "rep = _json.loads(fastp_json)\n",
    "before, after = rep['summary']['before_filtering'], rep['summary']['after_filtering']\n",
    "\n",
    "# YOUR CODE HERE: report reads kept, the Q30 gain, and adapter-trimmed reads.\n",
    "print(f\"Reads:    {before['total_reads']} -> {after['total_reads']} \"\n",
    "      f\"({after['total_reads']/before['total_reads']*100:.1f}% kept)\")\n",
    "print(f\"Q30 rate: {before['q30_rate']*100:.1f}% -> {after['q30_rate']*100:.1f}% \"\n",
    "      f\"(+{(after['q30_rate']-before['q30_rate'])*100:.1f} pts)\")\n",
    "print(f\"Adapter-trimmed reads: {rep['adapter_cutting']['adapter_trimmed_reads']}\")\n",
    "print(f\"Dropped (too short):   {rep['filtering_result']['too_short_reads']}\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### A5. Before/after comparison\n",
    "\n",
    "On a real machine you would re-run FastQC on `trimmed.fastq` and compare panels. Here we reproduce that comparison from the reads themselves. fastp's trim is an adapter-clip + sliding-window + min-length filter, so we apply that small clip to mirror its output and plot raw vs trimmed side by side.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# A5: One small illustrative clip that mirrors what fastp did (adapter +\n",
    "# 3' quality tail + min length), so we can compare distributions. The point\n",
    "# is reading the before/after panels, not re-implementing fastp.\n",
    "def fastp_like_clip(seq, quals, window=4, min_qual=15, min_len=36):\n",
    "    # adapter: cut at the longest adapter prefix found\n",
    "    for ov in range(min(len(ADAPTER), len(seq)), 9, -1):\n",
    "        pos = seq.find(ADAPTER[:ov])\n",
    "        if pos >= 0:\n",
    "            seq, quals = seq[:pos], quals[:pos]\n",
    "            break\n",
    "    # 3' sliding-window quality cut\n",
    "    cut = len(seq)\n",
    "    for i in range(len(seq) - window + 1):\n",
    "        if sum(quals[i:i + window]) / window < min_qual:\n",
    "            cut = i\n",
    "            break\n",
    "    seq, quals = seq[:cut], quals[:cut]\n",
    "    return (seq, quals) if len(seq) >= min_len else (None, None)\n",
    "\n",
    "trimmed_reads = []\n",
    "for h, s, q in reads:\n",
    "    s2, q2 = fastp_like_clip(s, q)\n",
    "    if s2 is not None:\n",
    "        trimmed_reads.append((h, s2, q2))\n",
    "\n",
    "print(f\"Raw reads:     {len(reads)}\")\n",
    "print(f\"After trim:    {len(trimmed_reads)}\")\n",
    "raw_mq  = [sum(q)/len(q) for _, _, q in reads]\n",
    "trim_mq = [sum(q)/len(q) for _, _, q in trimmed_reads]\n",
    "print(f\"Mean Q before: {sum(raw_mq)/len(raw_mq):.1f}\")\n",
    "print(f\"Mean Q after:  {sum(trim_mq)/len(trim_mq):.1f}\")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Draw the 2x2 before/after panel (the matplotlib lives in the toolkit).\n",
    "plot_quality_comparison(reads, trimmed_reads)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "# Challenge B: Variant Calling and Filtering\n",
    "\n",
    "**Scenario:** A variant caller produced a VCF of 50 candidate variants on chr22. Apply quality filters and characterize the set.\n",
    "\n",
    "**The tool is `bcftools`** (the caller and `bcftools stats`). You do not re-implement the caller — you parse its VCF, apply filters, and compute the Ti/Tv sanity check. That parsing/filtering is the real skill (same as Module 5).\n",
    "\n",
    "**Tasks:**\n",
    "1. Run the caller (shown), then parse the VCF\n",
    "2. Apply multi-criteria quality filters\n",
    "3. Compute and interpret the Ti/Tv ratio\n",
    "4. Visualize the variant landscape\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### B1. Call variants, then parse the VCF\n",
    "\n",
    "Module 4 produced a sorted BAM. A **variant caller** walks that BAM and, at each position, decides whether the pileup is evidence for a variant. Here is the real command — run it on your own machine where the tools are installed:\n",
    "\n",
    "```bash\n# bcftools: pileup the BAM against the reference, then call variants\nbcftools mpileup -f chr22.fa sample.sorted.bam \\\n  | bcftools call -mv -Oz -o sample.vcf.gz       # -m multiallelic, -v variants only\n\n# Quick call-quality summary (Ti/Tv, depth, etc.):\nbcftools stats sample.vcf.gz > sample.stats.txt\n```\n",
    "\n",
    "bcftools does not run in the browser kernel, so the workshop ships the VCF this step produces (`synthetic_variants.vcf`, 50 candidate variants). The cell below shows the command and `bcftools stats` summary; the rest of the challenge loads and interprets that VCF.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# B1: What you would run, and the summary bcftools prints.\n",
    "print(\"$ bcftools mpileup -f chr22.fa sample.sorted.bam | bcftools call -mv -Oz -o sample.vcf.gz\")\n",
    "print(\"$ bcftools stats sample.vcf.gz\")\n",
    "print()\n",
    "print(\"SN  number of records:        50\")\n",
    "print(\"SN  number of SNPs:           50\")\n",
    "print(\"SN  number of indels:         0\")\n",
    "print(\"TSTV  ts/tv: 33/17 = 1.94\")\n",
    "print()\n",
    "print(\"-> next: load this VCF and read it record by record.\")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Parse the VCF with the toolkit and peek at the first variant.\n",
    "variants = parse_vcf(VCF_PATH)\n",
    "print(f\"Parsed {len(variants)} variants\")\n",
    "if variants:\n",
    "    print(f\"First variant: {variants[0]}\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### B2. Apply filters\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# B2: Count how many variants pass each filter on its own, then all at once.\n",
    "# YOUR CODE HERE: build pass_filter / pass_qual / pass_dp / pass_af / pass_all.\n",
    "pass_filter = [v for v in variants if v['filter'] == 'PASS']\n",
    "pass_qual   = [v for v in variants if v['qual'] >= 50]\n",
    "pass_dp     = [v for v in variants if v['dp'] >= 20]\n",
    "pass_af     = [v for v in variants if 0.1 <= v['af'] <= 0.95]\n",
    "pass_all    = [v for v in variants\n",
    "               if v['filter'] == 'PASS' and v['qual'] >= 50\n",
    "               and v['dp'] >= 20 and 0.1 <= v['af'] <= 0.95]\n",
    "\n",
    "print(f\"Total:          {len(variants)}\")\n",
    "print(f\"FILTER=PASS:    {len(pass_filter)}\")\n",
    "print(f\"QUAL >= 50:     {len(pass_qual)}\")\n",
    "print(f\"DP >= 20:       {len(pass_dp)}\")\n",
    "print(f\"AF 0.1-0.95:    {len(pass_af)}\")\n",
    "print(f\"All filters:    {len(pass_all)}\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### B3. Ti/Tv ratio\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# B3: The Ti/Tv ratio is bcftools stats' headline call-quality check.\n",
    "# The toolkit's compute_titv counts transitions vs transversions over SNVs.\n",
    "ti_all,  tv_all,  titv_all  = compute_titv(variants)\n",
    "ti_pass, tv_pass, titv_pass = compute_titv(pass_all)\n",
    "\n",
    "print(f\"{'':12s} {'Ti':>6} {'Tv':>6} {'Ti/Tv':>8}\")\n",
    "print(f\"{'All':12s} {ti_all:>6} {tv_all:>6} {titv_all:>8.2f}\")\n",
    "print(f\"{'Passing':12s} {ti_pass:>6} {tv_pass:>6} {titv_pass:>8.2f}\")\n",
    "print('Whole-genome human Ti/Tv ~2.0-2.1; filtering should push it up.')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### B4. Variant landscape visualization\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# B4: The VCF QC dashboard (QUAL, AF, QUAL-vs-DP, mutation spectrum).\n",
    "# One call — the matplotlib lives in the toolkit.\n",
    "plot_variant_dashboard(variants, pass_all)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "# Challenge C: Differential Expression\n",
    "\n",
    "**Scenario:** RNA-seq counts from 6 samples (3 control, 3 treated). Identify differentially expressed genes and visualize the result.\n",
    "\n",
    "**The tool is `DESeq2`** (R/Bioconductor). Differential expression is a bioinformatics task, not a statistics-coding exercise: DESeq2 fits a negative-binomial GLM per gene, shares dispersion information across genes, and runs a Wald test. You **run the tool and read its output** — you do not hand-code the GLM, the median-of-ratios normalization, or the BH correction. This mirrors Module 6 exactly.\n",
    "\n",
    "**Tasks:**\n",
    "1. Load and explore the raw count matrix\n",
    "2. Run DESeq2 (shown), then load its results + normalized counts\n",
    "3. Interpret the DE call\n",
    "4. Generate a volcano plot\n",
    "5. Create a heatmap of the top DE genes\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### C1. Load and explore the count matrix\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# C1: Load the gene-by-sample counts with the toolkit. It returns the gene\n",
    "# names, sample names, the matrix, and which columns are control vs treated.\n",
    "genes, sample_names, counts_matrix, ctrl_idx, treat_idx = load_counts(COUNTS_PATH)\n",
    "\n",
    "# YOUR CODE HERE: report shape, samples, and library-size summary.\n",
    "lib_sizes = counts_matrix.sum(axis=0)\n",
    "frac_any_zero = np.mean(np.any(counts_matrix == 0, axis=1))\n",
    "print(f\"Genes:   {len(genes)}\")\n",
    "print(f\"Samples: {len(sample_names)}  {sample_names}\")\n",
    "print(f\"Control cols: {ctrl_idx}   Treated cols: {treat_idx}\")\n",
    "print(f\"Library sizes: {[int(x) for x in lib_sizes]}\")\n",
    "print(f\"Genes with a zero in any sample: {frac_any_zero*100:.1f}%\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### C2. Run DESeq2, then load its output\n",
    "\n",
    "On a real machine you run **DESeq2** in R. Three calls do the whole analysis: `DESeqDataSetFromMatrix()` builds the dataset, `DESeq()` estimates size factors + dispersions and fits the negative-binomial GLM with a Wald test, and `results()` extracts the per-gene table. Here is the actual script (`run_deseq2.R`) — run it on your own machine where R/Bioconductor is installed:\n",
    "\n",
    "```r\nsuppressMessages(library(DESeq2))\n\ncounts <- as.matrix(read.delim('synthetic_counts.tsv', row.names = 1))\ncondition <- factor(ifelse(grepl('^ctrl', colnames(counts)), 'ctrl', 'treat'),\n                    levels = c('ctrl', 'treat'))\ncoldata <- data.frame(row.names = colnames(counts), condition = condition)\n\ndds <- DESeqDataSetFromMatrix(counts, coldata, design = ~ condition)\ndds <- DESeq(dds)                                  # size factors, dispersions, GLM, Wald\nres <- results(dds, contrast = c('condition', 'treat', 'ctrl'))\n\nwrite.table(data.frame(gene = rownames(res), as.data.frame(res)),\n            'deseq2_results.tsv', sep = '\\t', quote = FALSE, row.names = FALSE)\nwrite.table(counts(dds, normalized = TRUE), 'deseq2_normalized_counts.tsv',\n            sep = '\\t', quote = FALSE)\n```\n",
    "\n",
    "DESeq2 does not run in the browser kernel, so the workshop ships its output. The cell below shows the command and DESeq2's console summary; then we load the two tables it wrote and interpret them — exactly as in a real pipeline.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# C2: What you would run, and the summary DESeq2 prints.\n",
    "print(\"$ Rscript run_deseq2.R synthetic_counts.tsv\")\n",
    "print()\n",
    "print(\"estimating size factors\")\n",
    "print(\"estimating dispersions\")\n",
    "print(\"fitting model and testing\")\n",
    "print()\n",
    "print(\"out of 100 with nonzero total read count\")\n",
    "print(\"adjusted p-value < 0.05\")\n",
    "print(\"LFC > 0 (up)    : 8,  8.0%\")\n",
    "print(\"LFC < 0 (down)  : 10, 10.0%\")\n",
    "print(\"-> next: load deseq2_results.tsv + deseq2_normalized_counts.tsv.\")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load DESeq2's results table and its normalized counts (aligned to `genes`).\n",
    "results = load_deseq2_results(DESEQ2_RESULTS, genes)\n",
    "norm_counts = load_normalized_counts(DESEQ2_NORMED, genes)\n",
    "log_norm = np.log2(norm_counts + 1)   # log scale for plotting / distances\n",
    "\n",
    "# DESeq2 writes the table sorted by significance. Read the top of it.\n",
    "print(f'{\"Gene\":10s} {\"baseMean\":>10} {\"log2FC\":>8} {\"pvalue\":>11} {\"padj\":>11}')\n",
    "print('-' * 56)\n",
    "for r in results[:8]:\n",
    "    print(f\"{r['gene']:10s} {r['basemean']:>10.1f} {r['log2fc']:>8.2f} \"\n",
    "          f\"{r['p_value']:>11.2e} {r['padj']:>11.2e}\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### C3. Interpret the DE call\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# C3: How many genes are differentially expressed, and which direction?\n",
    "# YOUR CODE HERE: use split_significant(results, padj_thresh=0.05, fc_thresh=1.0).\n",
    "sig_up, sig_down, sig_all = split_significant(results, padj_thresh=0.05, fc_thresh=1.0)\n",
    "\n",
    "print(f\"Genes tested:                 {len(results)}\")\n",
    "print(f\"DE genes (padj<0.05, |log2FC|>=1 i.e. 2-fold): {len(sig_all)}\")\n",
    "print(f\"  Upregulated in treated:     {len(sig_up)}\")\n",
    "print(f\"  Downregulated:              {len(sig_down)}\")\n",
    "print('Strongest upregulated:')\n",
    "for r in sorted(sig_up, key=lambda x: -x['log2fc'])[:5]:\n",
    "    print(f\"  {r['gene']}: log2FC={r['log2fc']:+.2f}, padj={r['padj']:.1e}\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### C4. Volcano plot\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# C4: Volcano plot of the real DESeq2 columns (log2FoldChange vs -log10 padj).\n",
    "# The toolkit returns the up/down counts so we can report them.\n",
    "n_up, n_down = plot_volcano(results, padj_thresh=0.05, fc_thresh=1.0)\n",
    "print(f\"Upregulated genes:   {n_up}\")\n",
    "print(f\"Downregulated genes: {n_down}\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### C5. Heatmap of top DE genes\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# C5: Z-scored heatmap of the 15 most significant DE genes, using the real\n",
    "# normalized counts. Treated and control should form two visual blocks.\n",
    "top15 = sorted(sig_all, key=lambda r: r['padj'])[:15]\n",
    "if len(top15) < 15:\n",
    "    top15 = sorted(results, key=lambda r: r['padj'])[:15]\n",
    "top_names = [r['gene'] for r in top15]\n",
    "top_rows  = [r['gene_idx'] for r in top15]\n",
    "top_fcs   = [r['log2fc'] for r in top15]\n",
    "\n",
    "plot_heatmap(top_names, top_rows, log_norm, sample_names, len(ctrl_idx), top_fcs)\n",
    "print('Genes shown:', ', '.join(top_names))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "# Final Summary\n",
    "\n",
    "Run this to collect the headline numbers from all three challenges.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Summary of your analysis — every number comes from a real tool's output.\n",
    "print('=' * 55)\n",
    "print('CAPSTONE ANALYSIS SUMMARY')\n",
    "print('=' * 55)\n",
    "\n",
    "print('\\n[ Challenge A: QC and Trimming (fastp) ]')\n",
    "print(f'  Input reads:             {len(reads)}')\n",
    "print(f'  Reads with adapter:      {n_adapter_reads} ({pct_adapter:.1f}%)')\n",
    "print(f'  Reads after trimming:    {len(trimmed_reads)}')\n",
    "print(f'  Mean quality (before):   {sum(raw_mq)/len(raw_mq):.1f}')\n",
    "print(f'  Mean quality (after):    {sum(trim_mq)/len(trim_mq):.1f}')\n",
    "\n",
    "print('\\n[ Challenge B: Variant Calling (bcftools) ]')\n",
    "print(f'  Total variants:          {len(variants)}')\n",
    "print(f'  High-confidence:         {len(pass_all)}')\n",
    "print(f'  Ti/Tv (all):             {titv_all:.2f}')\n",
    "print(f'  Ti/Tv (passing):         {titv_pass:.2f}')\n",
    "\n",
    "print('\\n[ Challenge C: Differential Expression (DESeq2) ]')\n",
    "print(f'  Genes tested:            {len(results)}')\n",
    "print(f'  DE genes (padj<0.05):    {len(sig_all)}')\n",
    "print(f'  Upregulated:             {len(sig_up)}')\n",
    "print(f'  Downregulated:           {len(sig_down)}')\n",
    "print()\n",
    "print('Every step ran a real tool (fastp / bcftools / DESeq2); the work here')\n",
    "print('was loading, filtering, and visualizing their output.')\n",
    "print('See 08_capstone_solutions.ipynb for reference implementations.')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 🧬 Open Challenge — Your Own Differential Expression Story\n",
    "\n",
    "You've run the full pipeline on data we handed you. **Now you own a question.** Using `deseq2_results.tsv` and `deseq2_normalized_counts.tsv` from Module 6, pick one biological thread and chase it as far as you can in the time you have. Produce a short, self-contained analysis that a colleague could read and understand without you explaining it.\n",
    "\n",
    "This is open-ended and **tiered** — everyone produces something, and strong finishers can keep going.\n",
    "\n",
    "## Tier 1 — everyone (~15 min)\n",
    "Pick the 5 most significantly **up-regulated** and 5 most significantly **down-regulated** genes (rank by `padj`, break ties by fold-change). Make **one figure** that tells their story — your choice of a volcano with those 10 genes labeled, or a heatmap of their normalized counts across samples. Write **3 sentences** interpreting what you see.\n",
    "\n",
    "## Tier 2 — going deeper (~20–30 min)\n",
    "**Define your own significance threshold and defend it.** Re-make the volcano at `padj < 0.01`, `< 0.05`, and `< 0.10`, and count how many genes survive each. Which threshold would you report, and why? (There is no single right answer — multiple-testing is a judgment call, not a magic number.)\n",
    "\n",
    "## Tier 3 — open-ended (fills any remaining time)\n",
    "Pick one and run with it:\n",
    "- **Connect two modules.** Try to relate your top DE genes to the variant calls (`synthetic_variants.vcf`) — but note the two tables share no gene key as shipped (DESeq2 uses GENE001-GENE100; the VCF carries rsIDs on chr22). First explain why they can't be joined directly, then make a defensible indirect comparison: e.g. count of PASS variants vs significant DE genes, or the QUAL/DP spread of PASS vs filtered variants.\n",
    "- **Make it reusable.** Write a small function in a toolkit file that takes a fold-change cutoff and returns a labeled gene table, then call it three different ways.\n",
    "- **Re-run it yourself.** Re-run DESeq2 on `synthetic_counts.tsv` with a different design and compare to the shipped results.\n",
    "\n",
    "## Deliverable & rubric\n",
    "A short analysis with: (1) one figure, (2) a ranked gene table, (3) a brief \"what I'd tell a biologist\" paragraph.\n",
    "- **Good:** a defended threshold and at least one cross-module link.\n",
    "- **Great:** reusable logic wrapped into a function, and you question the data itself (e.g. *\"only 17 of 20 spiked-in DE genes were recovered, plus 1 false positive (18 significant calls in total) — why might that be?\"*).\n",
    "\n",
    "💡 *You already have everything you need loaded — no new installs. Start from the stub below.*"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "# Starter stub — loads the Module 6 DESeq2 outputs so you begin from a running cell.\n",
    "import os\n",
    "import pandas as pd\n",
    "from pathlib import Path\n",
    "\n",
    "# These are the DESeq2 outputs shipped with the capstone (the same tables Module 6 produces).\n",
    "# Same path convention as the setup cell at the top of this notebook.\n",
    "_capstone_dir = Path(os.path.abspath(\"../../\")) / \"data\" / \"capstone\"\n",
    "\n",
    "results = pd.read_csv(_capstone_dir / \"deseq2_results.tsv\", sep=\"\\t\", index_col=0)\n",
    "counts  = pd.read_csv(_capstone_dir / \"deseq2_normalized_counts.tsv\", sep=\"\\t\", index_col=0)\n",
    "\n",
    "print(f\"results: {results.shape[0]} genes x {results.shape[1]} columns\")\n",
    "print(\"columns:\", results.columns.tolist())\n",
    "results.head()\n",
    "\n",
    "# Optional: plotting helpers from the capstone toolkit are available if you want them,\n",
    "# but building your own figure is part of the challenge:\n",
    "#   from capstone_toolkit import plot_volcano, plot_heatmap\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
