{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Module 3: Quality Control and Read Trimming\n",
    "\n",
    "**Duration:** 60 minutes\n",
    "**Day:** 1 of 2\n",
    "\n",
    "## Learning Objectives\n",
    "\n",
    "By the end of this module you will be able to:\n",
    "- Profile raw reads with **FastQC** and read its per-base quality, GC and adapter panels\n",
    "- Trim adapters and low-quality tails with **fastp** / **Trimmomatic**, and choose the parameters\n",
    "- Aggregate many samples into one report with **MultiQC** and spot the outlier lane\n",
    "- Parse a tool's report (FastQC summary, fastp JSON) with a few lines of Python\n",
    "- Decide *when to trim and when trimming hurts* (over-trimming)\n",
    "\n",
    "---\n",
    "\n",
    "> **Where we left off.** In Module 2 you parsed **FASTQ** and decoded **Phred**\n",
    "> scores per read. QC is the gate every dataset passes through next: *is this data\n",
    "> good enough, and if not, how do we fix it?* The Phred encoding (`Q = ord(char) - 33`)\n",
    "> carries straight over."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> **How this module works - it's a QC *workflow*, not a Python course.** On a real\n",
    "> machine QC is a short pipeline of command-line tools:\n",
    ">\n",
    "> ```\n",
    "> fastqc  ->  fastp / Trimmomatic  ->  fastqc again  ->  multiqc\n",
    "> profile      trim & clean            re-profile        aggregate\n",
    "> ```\n",
    ">\n",
    "> Those are compiled tools - they don't run in this browser kernel - so for each step\n",
    "> you'll see **the exact command** and **the report it produces**, and your job is to\n",
    "> *read and interpret* it. The Python we keep is the part you genuinely do every day:\n",
    "> **load the real reads** and **parse the tools' reports**. The metric/plot helpers live\n",
    "> in **`qc_toolkit.py`** (download from the module page) so each cell stays one or two lines."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "import matplotlib.pyplot as plt\n",
    "import numpy as np\n",
    "import json, os\n",
    "from pathlib import Path\n",
    "from collections import Counter"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "# qc_toolkit.py holds the metric + plotting helpers (download it from the page).\n",
    "from qc_toolkit import (\n",
    "    ADAPTER, compute_per_position_stats, trim_adapter, trim_reads,\n",
    "    gc_content, adapter_content, plot_per_base_quality, plot_gc_and_adapter, plot_before_after,\n",
    ")\n",
    "print(\"QC toolkit loaded.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## Section 1: The data - load real reads\n",
    "\n",
    "We work on a **real FASTQ file** shipped with the workshop (`synthetic_reads.fastq`).\n",
    "The one piece of Python you always write yourself is the loader - parsing FASTQ's\n",
    "4-lines-per-read layout:"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "# A minimal FASTQ reader: records are groups of 4 lines.\n",
    "def parse_fastq(filepath):\n",
    "    \"\"\"Yield (header, sequence, quality_scores) for each read.\"\"\"\n",
    "    with open(filepath) as f:\n",
    "        while True:\n",
    "            header = f.readline().rstrip()\n",
    "            seq    = f.readline().rstrip()\n",
    "            _plus  = f.readline()\n",
    "            qual   = f.readline().rstrip()\n",
    "            if not header:\n",
    "                break\n",
    "            yield header[1:], seq, [ord(c) - 33 for c in qual]\n",
    "\n",
    "data_dir = Path(os.path.abspath(\"../../\")) / \"data\" / \"capstone\"\n",
    "reads = list(parse_fastq(data_dir / \"synthetic_reads.fastq\"))\n",
    "\n",
    "lens = [len(s) for _, s, _ in reads]\n",
    "full_len = max(lens)\n",
    "n_short = sum(1 for L in lens if L < full_len)\n",
    "print(f\"Loaded {len(reads)} reads\")\n",
    "print(f\"Read length: {min(lens)}-{full_len} bp (mean {sum(lens)/len(lens):.0f})\")\n",
    "print(f\"Full-length ({full_len} bp) reads: {len(reads) - n_short}\")\n",
    "print(f\"Shorter reads (insert < read length): {n_short} ({n_short/len(reads)*100:.1f}%)\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## Section 2: Profile with FastQC\n",
    "\n",
    "**FastQC** is the first thing you run on any new FASTQ. One command writes an HTML report\n",
    "plus a zip of the raw metrics:\n",
    "\n",
    "```bash\n",
    "fastqc synthetic_reads.fastq -o qc/\n",
    "# open qc/synthetic_reads_fastqc.html in a browser\n",
    "```\n",
    "\n",
    "It scores ~12 modules PASS / WARN / FAIL. A `summary.txt` for a file like ours reads:\n",
    "\n",
    "```text\n",
    "PASS   Basic Statistics\n",
    "WARN   Per base sequence quality\n",
    "PASS   Per sequence quality scores\n",
    "PASS   Per sequence GC content\n",
    "WARN   Sequence Length Distribution\n",
    "WARN   Sequence Duplication Levels\n",
    "FAIL   Adapter Content\n",
    "\n",
    "Basic Statistics\n",
    "    Total Sequences        1000\n",
    "    Sequence length        113-150\n",
    "    %GC                     50\n",
    "```\n",
    "\n",
    "**How to read it:** *Per base sequence quality* is **WARN** because quality sags toward\n",
    "the 3' end; *Sequence Length Distribution* is **WARN** because reads vary in length; and\n",
    "*Adapter Content* is **FAIL** because a chunk of reads run into the adapter. Those are the\n",
    "flags trimming addresses. Let's reproduce FastQC's headline panels from the same reads so\n",
    "the report stops being a black box."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "# FastQC's per-base quality panel, computed on our real reads.\n",
    "pp = compute_per_position_stats(reads)\n",
    "print(f\"Mean Q at position    1: {pp[0]['mean']:.1f}\")\n",
    "print(f\"Mean Q at position   75: {pp[74]['mean']:.1f}\")\n",
    "print(f\"Mean Q at last position: {pp[-1]['mean']:.1f}\")\n",
    "\n",
    "plot_per_base_quality(reads, title=\"Per-Base Sequence Quality (raw reads)\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The same plot, live - drag the 3' decay slider and watch how many bases fall below the\n",
    "Q20 / Q30 lines:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<!--widget:quality-->"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "FastQC's other two headline panels are **GC content** (should match the organism - ~41%\n",
    "for human) and **adapter content** (the % of reads carrying adapter at each position).\n",
    "The adapter spike is this file's real problem:"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "plot_gc_and_adapter(reads, title_suffix=\" (raw)\")\n",
    "\n",
    "stub = \"AGATCGGAAGAG\"\n",
    "n_adapter = sum(1 for _, s, _ in reads if stub in s)\n",
    "print(f\"Reads carrying the adapter stub: {n_adapter} ({n_adapter/len(reads)*100:.1f}%)\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Adapter contamination happens when the DNA fragment is shorter than the read, so the\n",
    "sequencer reads past the insert into the adapter. Shrink the insert below and watch the\n",
    "3' adapter content climb:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<!--widget:adapter-->"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## Section 3: Trim with fastp / Trimmomatic\n",
    "\n",
    "Trimming does three things, in order: **clip the adapter**, **cut low-quality 3' tails**\n",
    "with a sliding window, and **drop reads that end up too short**. Two tools dominate.\n",
    "\n",
    "**Trimmomatic** - the classic, each step spelled out:\n",
    "\n",
    "```bash\n",
    "trimmomatic SE -threads 4 synthetic_reads.fastq trimmed.fastq \\\n",
    "  ILLUMINACLIP:TruSeq3-SE.fa:2:30:10 \\\n",
    "  SLIDINGWINDOW:4:20 \\\n",
    "  MINLEN:36\n",
    "```\n",
    "\n",
    "**fastp** - the modern all-in-one: *auto-detects* the adapter, trims, filters, and writes\n",
    "its own QC report in a single fast pass:\n",
    "\n",
    "```bash\n",
    "fastp -i synthetic_reads.fastq -o trimmed.fastq \\\n",
    "  --cut_right --cut_right_window_size 4 --cut_right_mean_quality 15 \\\n",
    "  --length_required 36 \\\n",
    "  --html fastp.html --json fastp.json\n",
    "```\n",
    "\n",
    "`SLIDINGWINDOW:4:20` and `--cut_right ... 15` are the **same idea** - slide a 4 bp window,\n",
    "cut when its mean quality drops - with different default cut-offs. Turn that knob here:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<!--widget:trim-->"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### The trim, applied to our reads\n",
    "\n",
    "`trim_reads` composes the three steps (adapter-clip -> sliding window -> length filter).\n",
    "The biggest win on this file is **removing the adapter** - watch the adapter-content panel\n",
    "go flat after trimming:"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "trimmed, n_discarded = trim_reads(reads, window=4, min_qual=20, min_len=36)\n",
    "print(f\"Kept {len(trimmed)}/{len(reads)} reads; discarded {n_discarded} as too short.\")\n",
    "\n",
    "# Adapter content after trimming - the FAIL from FastQC should clear.\n",
    "plot_gc_and_adapter(trimmed, title_suffix=\" (after trim)\")\n",
    "stub = \"AGATCGGAAGAG\"\n",
    "print(f\"Reads still carrying adapter: \"\n",
    "      f\"{sum(1 for _, s, _ in trimmed if stub in s)} (was {sum(1 for _, s, _ in reads if stub in s)})\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The before/after dashboard. On this lane the **quality** is already good (most bases are\n",
    "Q30+), so quality-trimming barely moves the curve - the honest, common real-world result:\n",
    "*QC, and sometimes the verdict is \"this data is already clean.\"* The visible change is the\n",
    "read-length distribution shifting left as the adapter is clipped:"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "plot_before_after(reads, trimmed)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Reading fastp's report\n",
    "\n",
    "fastp writes `fastp.json`. You don't eyeball it - you parse the numbers you care about,\n",
    "and *that* parsing is the Python worth keeping. Here is the relevant slice fastp produces\n",
    "for a file like ours (the rates match what we measured above):"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "# The fields you'd actually pull out of fastp.json.\n",
    "fastp_json = '''\n",
    "{\n",
    "  \"summary\": {\n",
    "    \"before_filtering\": {\"total_reads\": 1000, \"q20_rate\": 0.917, \"q30_rate\": 0.752},\n",
    "    \"after_filtering\":  {\"total_reads\": 994,  \"q20_rate\": 0.999, \"q30_rate\": 0.874}\n",
    "  },\n",
    "  \"adapter_cutting\": {\"adapter_trimmed_reads\": 97},\n",
    "  \"duplication\": {\"rate\": 0.061}\n",
    "}\n",
    "'''\n",
    "rep = json.loads(fastp_json)\n",
    "before, after = rep[\"summary\"][\"before_filtering\"], rep[\"summary\"][\"after_filtering\"]\n",
    "print(f\"Reads:    {before['total_reads']} -> {after['total_reads']} \"\n",
    "      f\"({after['total_reads']/before['total_reads']*100:.1f}% kept)\")\n",
    "print(f\"Q30 rate: {before['q30_rate']*100:.1f}% -> {after['q30_rate']*100:.1f}%\")\n",
    "print(f\"Adapter-trimmed reads: {rep['adapter_cutting']['adapter_trimmed_reads']} \"\n",
    "      f\"({rep['adapter_cutting']['adapter_trimmed_reads']/before['total_reads']*100:.1f}%)\")\n",
    "print(f\"Duplication rate:      {rep['duplication']['rate']*100:.1f}%\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## Section 4: Aggregate with MultiQC\n",
    "\n",
    "One sample you can eyeball. A real run is dozens of lanes. **MultiQC** scans a folder of\n",
    "FastQC / fastp / Trimmomatic outputs and collapses them into one report:\n",
    "\n",
    "```bash\n",
    "multiqc qc/ -o multiqc_report/\n",
    "# open multiqc_report/multiqc_report.html\n",
    "```\n",
    "\n",
    "Its \"General Statistics\" table puts every sample on one row. For a 3-lane run it might read:\n",
    "\n",
    "| Sample          | M Seqs | % Dups | % GC | Mean Q | % Adapter | % Pass trim |\n",
    "|-----------------|:------:|:------:|:----:|:------:|:---------:|:-----------:|\n",
    "| Sample_A_good   |  1.5   |   6%   | 41%  |  36.1  |    3%     |     98%     |\n",
    "| Sample_B_ok     |  1.5   |  11%   | 42%  |  32.4  |   12%     |     91%     |\n",
    "| Sample_C_poor   |  1.5   |  28%   | 45%  |  24.7  |   31%     |     68%     |\n",
    "\n",
    "Reading down the columns, the bad lane jumps out - and flagging it is a one-liner you'd\n",
    "really write:"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "multiqc_rows = [\n",
    "    {\"sample\": \"Sample_A_good\", \"mean_q\": 36.1, \"adapter\": 3,  \"pass\": 98},\n",
    "    {\"sample\": \"Sample_B_ok\",   \"mean_q\": 32.4, \"adapter\": 12, \"pass\": 91},\n",
    "    {\"sample\": \"Sample_C_poor\", \"mean_q\": 24.7, \"adapter\": 31, \"pass\": 68},\n",
    "]\n",
    "# Flag any lane with mean Q < 28 or adapter > 5% - MultiQC's usual eyeball test.\n",
    "for r in multiqc_rows:\n",
    "    flag = \"  <-- OUTLIER\" if (r[\"mean_q\"] < 28 or r[\"adapter\"] > 5) else \"\"\n",
    "    print(f\"{r['sample']:16s} meanQ={r['mean_q']:5.1f}  adapter={r['adapter']:2d}%  \"\n",
    "          f\"pass={r['pass']:2d}%{flag}\")\n",
    "worst = min(multiqc_rows, key=lambda r: r[\"mean_q\"])\n",
    "print(f\"\\nWorst lane: {worst['sample']} - quality collapses and a third of reads carry adapter.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## Exercises\n",
    "\n",
    "### Exercise 1: Read the QC of our file\n",
    "\n",
    "Using `reads` (loaded in Section 1), answer the two questions FastQC's report raised."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "# (a) What % of reads contain the Illumina adapter stub AGATCGGAAGAG?\n",
    "stub = \"AGATCGGAAGAG\"\n",
    "pct_adapter = sum(1 for _, s, _ in reads if stub in s) / len(reads) * 100\n",
    "print(f\"(a) Reads with adapter: {pct_adapter:.1f}%\")\n",
    "\n",
    "# (b) How much does quality decay 5' -> 3'? Compare the first vs last 20 positions.\n",
    "pp = compute_per_position_stats(reads)\n",
    "head_q = sum(s['mean'] for s in pp[:20]) / 20\n",
    "tail_q = sum(s['mean'] for s in pp[-20:]) / 20\n",
    "print(f\"(b) Mean Q first 20 bp: {head_q:.1f}   last 20 bp: {tail_q:.1f}   \"\n",
    "      f\"(drop of {head_q - tail_q:.1f} Q)\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 2: Parse the Q30 gain from a fastp report\n",
    "\n",
    "Tool reports are JSON - extract what you need. Given fastp's before/after Q30 rates,\n",
    "compute how many *percentage points* of Q30 the trim bought."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "report = json.loads(fastp_json)\n",
    "q30_before = report[\"summary\"][\"before_filtering\"][\"q30_rate\"]\n",
    "q30_after  = report[\"summary\"][\"after_filtering\"][\"q30_rate\"]\n",
    "print(f\"Q30 before: {q30_before*100:.1f}%\")\n",
    "print(f\"Q30 after:  {q30_after*100:.1f}%\")\n",
    "print(f\"Gain: +{(q30_after - q30_before)*100:.1f} percentage points of bases at Q30+\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 3: Over-trimming - when cleaning hurts\n",
    "\n",
    "Trim harder and you keep cleaner reads, but you throw more away. Sweep the sliding-window\n",
    "quality cut-off on our real reads and find where trimming turns destructive."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "print(\"  min_qual | % discarded | mean_len\")\n",
    "print(\"  \" + \"-\" * 34)\n",
    "first_over_20 = None\n",
    "for min_q in [20, 25, 28, 30, 32, 35, 40]:\n",
    "    kept, discarded = trim_reads(reads, min_qual=min_q, min_len=36)\n",
    "    pct = discarded / len(reads) * 100\n",
    "    mean_len = sum(len(s) for _, s, _ in kept) / max(1, len(kept))\n",
    "    if pct > 20 and first_over_20 is None:\n",
    "        first_over_20 = min_q\n",
    "    print(f\"  {min_q:8d} | {pct:11.1f} | {mean_len:7.1f}\")\n",
    "print(f\"\\nQ20-30 discards almost nothing (this lane is clean) but keeps shrinking the reads.\")\n",
    "print(f\"Past Q{first_over_20} the sliding window cuts into good sequence and we lose >20% of reads;\")\n",
    "print(f\"by Q35+ almost no 4 bp window can average that high, so reads are cut to nothing.\")\n",
    "print(f\"Q20-25 is the sweet spot.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## Summary\n",
    "\n",
    "| QC step | Real command | What you read |\n",
    "|---------|--------------|---------------|\n",
    "| Profile raw reads | `fastqc reads.fastq -o qc/` | per-base quality, GC, adapter, dup |\n",
    "| Trim (classic) | `trimmomatic SE ... SLIDINGWINDOW:4:20 MINLEN:36` | reads in/out |\n",
    "| Trim (modern) | `fastp -i in.fq -o out.fq --cut_right ...` | `fastp.json` Q20/Q30, adapter, dup |\n",
    "| Re-profile | `fastqc trimmed.fastq -o qc/` | confirm the WARN/FAIL cleared |\n",
    "| Aggregate | `multiqc qc/ -o multiqc_report/` | one row per sample; find the outlier |\n",
    "\n",
    "**What ran where:** the commands above are the real workflow - run them where the tools\n",
    "are installed. The Python here did the two jobs you do by hand: **loaded the reads** and\n",
    "**parsed the reports**; the metric/plot helpers live in `qc_toolkit.py`.\n",
    "\n",
    "**Key takeaways:**\n",
    "- Always QC *before and after* trimming - re-running FastQC is how you prove the trim worked.\n",
    "- This lane's real problem was **adapter**, not quality - trimming's job is to fix the\n",
    "  flagged issue, not to trim for its own sake.\n",
    "- fastp is roughly Trimmomatic + FastQC in one pass and auto-detects the adapter.\n",
    "- Over-trimming (min_qual > 30) discards good reads; Q20-25 is usually enough.\n",
    "\n",
    "**Next:** Module 4 - Read Alignment."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.10.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}