{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Module 1: Linux CLI for Bioinformatics\n",
    "\n",
    "**Duration:** 45 minutes &nbsp;·&nbsp; **Day:** 1 of 2\n",
    "\n",
    "## Learning objectives\n",
    "\n",
    "By the end of this module you will be able to:\n",
    "\n",
    "- Navigate a directory tree and inspect files with core Linux commands\n",
    "- Search file contents with `grep` and regular expressions\n",
    "- Chain commands with **pipes** to count, sort, and deduplicate data — the daily bread of genomics\n",
    "- Slice columns and compute with `cut`, `sort`, `uniq`, and `awk`\n",
    "- Wrap shell commands in a loop (and, at the end, in Python) to automate across many samples\n",
    "\n",
    "---\n",
    "\n",
    "> **Why the command line?** Every major bioinformatics tool — BWA, GATK, SAMtools,\n",
    "> Trimmomatic, HISAT2 — is command-line only. Automation, reproducibility, and scale\n",
    "> all live at the terminal.\n",
    "\n",
    "### How this page works — everything here is real bash\n",
    "\n",
    "This module is taught in **bash**, the language of the terminal. You have two ways to run it,\n",
    "and they use the **exact same commands**:\n",
    "\n",
    "1. **The sandbox shell** at the top of the page is a real (in-browser) `bash` terminal.\n",
    "   Type any command there and watch it run.\n",
    "2. **The cells below** begin with `%%bash`. Everything under that line is bash — the *same*\n",
    "   text you'd type in the sandbox. Press ▶ to run a cell and see terminal output right here.\n",
    "\n",
    "There is **no Python to learn first**. The only Python in this module is a short final\n",
    "section showing how you'd *automate* these same commands once you know them — clearly labelled\n",
    "when we get there. All commands use paths relative to the workshop folder (e.g.\n",
    "`data/example/example.fastq`), so a command that works in a cell works verbatim in the sandbox."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "# Run this once. It just confirms the shell is ready (and, in the full Jupyter\n",
    "# view, wires up the %%bash cell magic). You do not need to understand it.\n",
    "from workshop_shell import setup\n",
    "setup()\n",
    "print(\"Shell ready — every cell below that starts with %%bash is real terminal bash.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### How to read a command — and how to get unstuck\n",
    "\n",
    "Every command you type has the same shape:\n",
    "\n",
    "`command   -options   arguments`\n",
    "\n",
    "Take `ls -lh data/example`:\n",
    "\n",
    "- **`ls`** is the *command* — the program to run.\n",
    "- **`-lh`** are *options* (a.k.a. flags) that change how it behaves — here `-l` (long listing) and `-h` (human-readable sizes), bundled together.\n",
    "- **`data/example`** is the *argument* — what to act on.\n",
    "\n",
    "Options are usually short (`-l`) or long (`--help`); arguments are the files or folders you point the command at. That's the whole grammar — every command below is just this pattern.\n",
    "\n",
    "**Stuck on a command? Three ways to teach yourself — the most useful skill on this page:**\n",
    "\n",
    "- `command --help` — a quick summary of the options (e.g. `grep --help`).\n",
    "- `man command` — the full manual (e.g. `man grep`); press `q` to quit.\n",
    "- [explainshell.com](https://explainshell.com) — paste a whole pipeline and it labels every piece for you.\n",
    "\n",
    "(`man` and `--help` belong to a real terminal; the simplified in-browser kernel here may not implement them, so reach for them on your own machine.)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## Section 1 · Navigation and file management\n",
    "\n",
    "The workshop data lives in a folder called `data/`. Let's explore it the way you would at\n",
    "a terminal. Start with **where am I** and **what's here**."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "%%bash\n",
    "# pwd = \"print working directory\": where the shell is right now\n",
    "pwd"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "%%bash\n",
    "# ls lists a directory. -l = long (one per line, with size), -h = human-readable sizes\n",
    "ls -lh data/example"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`find` walks a whole tree, not just one level — handy when data is nested in sub-folders.\n",
    "`du` (\"disk usage\") tells you how big things are, and `file` peeks at the *content* to report\n",
    "what a file actually is (not just its extension)."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "%%bash\n",
    "# Every file under data/, at any depth\n",
    "find data -type f"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "%%bash\n",
    "# How big is each example file? (-s = summary, -h = human-readable)\n",
    "du -sh data/example/*"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "%%bash\n",
    "# What KIND of file is each one, judged by content?\n",
    "file data/example/*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You create and organise output the same way: `mkdir -p` makes nested folders in one go\n",
    "(the `-p` means \"and any parents, no error if they already exist\"), and `cp` copies a file."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "%%bash\n",
    "# Make a results folder structure, copy a file into it, then look\n",
    "mkdir -p results/qc results/alignment\n",
    "cp data/example/example.fastq results/sample.fastq\n",
    "find results"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## Section 2 · Viewing and searching files\n",
    "\n",
    "In bioinformatics you constantly peek inside large files, count records, and search for\n",
    "patterns. `head` shows the first lines (`-n N` sets how many); `tail` shows the last."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Pick the viewer that fits the file's size.** `cat` prints a file *whole* — fine for something small like `genes.tsv` (31 lines). But genomics files are huge: a single FASTQ can run to tens of millions of lines, so you never `cat` one — you *peek* with `head` (first lines) or `tail` (last), or page through it with `less` (arrow keys to scroll, `q` to quit). Rule of thumb: **`cat` the small, `head`/`less` the big.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "# cat prints the WHOLE file. genes.tsv is tiny (31 lines), so this is safe.\n",
    "# Never `cat` a multi-gigabyte FASTQ/BAM — peek with head or page with less instead.\n",
    "cat data/example/genes.tsv"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "%%bash\n",
    "# FASTA: a '>' header line followed by sequence lines\n",
    "head -n 8 data/example/example.fasta"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "%%bash\n",
    "# FASTQ: four lines per read — header, sequence, '+' separator, quality string\n",
    "head -n 8 data/example/example.fastq"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### What those four lines actually mean\n",
    "\n",
    "FASTQ stores **four lines per read**: a header, the bases, a `+` separator, and a quality line.\n",
    "That fourth line encodes a *per-base confidence score* as a single ASCII character. Step through\n",
    "a record below and hover the quality track to decode a Phred score into an error probability."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<!--widget:fastq-->"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`wc -l` counts lines. Because each FASTQ read is exactly 4 lines, the number of reads is the\n",
    "line count divided by 4 — a perfect first job for a pipe and a little arithmetic."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "%%bash\n",
    "# Count lines, then compute reads = lines / 4.\n",
    "# $( ... ) runs a command and captures its output; $(( ... )) does integer math.\n",
    "wc -l data/example/example.fastq\n",
    "echo \"Reads: $(( $(wc -l < data/example/example.fastq) / 4 ))\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`grep` finds lines matching a pattern. The pattern is a *regular expression*: `^>` means\n",
    "\"a line that **starts with** `>`\". Add `-c` to count matches instead of printing them, and\n",
    "`-v` to **invert** (keep the lines that *don't* match)."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "%%bash\n",
    "# Show the FASTA header lines...\n",
    "grep '^>' data/example/example.fasta"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "%%bash\n",
    "# ...and just count them (each header = one sequence)\n",
    "grep -c '^>' data/example/example.fasta"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "%%bash\n",
    "# Count FASTQ records (their headers start with '@')\n",
    "grep -c '^@' data/example/example.fastq"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now a table. `genes.tsv` is tab-separated, with columns:\n",
    "`gene_id  gene_name  chromosome  start  end  strand`. Peek at it, then use `grep -E`\n",
    "(extended regex) to pull rows for one chromosome."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "%%bash\n",
    "head -n 5 data/example/genes.tsv"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "%%bash\n",
    "# Genes on chromosome 17 (the column 3 value is exactly 'chr17')\n",
    "grep -E '^[^\t]*\t[^\t]*\tchr17\t' data/example/genes.tsv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## Section 3 · Pipes and text processing\n",
    "\n",
    "The pipe `|` sends one command's output straight into the next. This is **the** big idea of\n",
    "the terminal: small tools, each doing one thing, snapped together into a pipeline.\n",
    "\n",
    "A classic question — *\"which chromosome has the most genes?\"* — is four small steps:\n",
    "`cut` one column → `sort` so identical values are adjacent → `uniq -c` to collapse and count\n",
    "→ `sort -rn` to rank. (We use `tail -n +2` first to drop the header row.)"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "%%bash\n",
    "# cut -fN keeps column N (tab-separated by default)\n",
    "tail -n +2 data/example/genes.tsv | cut -f3 | head"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "%%bash\n",
    "# The full pipeline: chromosome → grouped → counted → ranked\n",
    "tail -n +2 data/example/genes.tsv | cut -f3 | sort | uniq -c | sort -rn"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> **The #1 pipeline bug:** `uniq` only collapses *adjacent* duplicates, so it only counts\n",
    "> correctly if you `sort` first. Forget the `sort` and your counts come out wrong. Toggle the\n",
    "> stages in the explorer below to feel exactly why."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<!--widget:pipe-->"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When you need arithmetic on columns — not just counting — reach for **`awk`**. It runs a little\n",
    "program once per line, with `$1 $2 $3 ...` as the columns and `NR` as the current row number.\n",
    "Here: skip the header (`NR>1`), then print each gene's name and its length (`end − start`)."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "%%bash\n",
    "# Gene name (col 2) and length (col 5 - col 4), longest first.\n",
    "# grep -v NULL first drops the one row with missing coordinates — clean your data!\n",
    "grep -v NULL data/example/genes.tsv | awk 'NR>1 {print $2, $5-$4}' | sort -k2,2rn | head -n 5"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## Section 4 · Loops\n",
    "\n",
    "Real projects have *many* samples. A `for` loop runs the same commands for each item, so you\n",
    "never copy-paste a pipeline twelve times. Read it as: *for each VALUE in a list, do these\n",
    "commands*. Inside the loop, `$VAR` is the current value."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "%%bash\n",
    "# Pretend these are sample names; print the command we'd run for each\n",
    "for SAMPLE in SRR001 SRR002 SRR003; do\n",
    "  echo \"would align $SAMPLE  ->  results/alignment/${SAMPLE}.bam\"\n",
    "done"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "%%bash\n",
    "# A loop that does real work: count genes on each chromosome of interest\n",
    "for CHR in chr17 chr7 chr1; do\n",
    "  COUNT=$(cut -f3 data/example/genes.tsv | grep -c \"^$CHR$\")\n",
    "  echo \"$CHR: $COUNT genes\"\n",
    "done"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A reusable analysis is just a loop like this saved in a `.sh` script. Production pipelines start\n",
    "with **`set -euo pipefail`** so the script stops on the first error instead of charging ahead.\n",
    "You don't run heavy tools (BWA, SAMtools) in the browser, but the *shape* is exactly what you'll\n",
    "write on a real machine:\n",
    "\n",
    "```bash\n",
    "#!/usr/bin/env bash\n",
    "set -euo pipefail                  # stop on errors, unset vars, and pipe failures\n",
    "\n",
    "SAMPLES=(SRR001 SRR002 SRR003)\n",
    "REF=\"reference/chr22.fa\"\n",
    "\n",
    "mkdir -p results/alignment\n",
    "for SAMPLE in \"${SAMPLES[@]}\"; do\n",
    "    echo \"[$(date)] aligning ${SAMPLE}\"\n",
    "    bwa mem -t 4 \"${REF}\" \"data/${SAMPLE}.fastq.gz\" \\\n",
    "        | samtools sort -o \"results/alignment/${SAMPLE}.bam\"\n",
    "    samtools index \"results/alignment/${SAMPLE}.bam\"\n",
    "done\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## Section 5 · Working with compressed files\n",
    "\n",
    "Real sequencing data is almost always **gzip-compressed** (`.fastq.gz`). The trick is you don't\n",
    "need to unzip it first: `zcat` streams the decompressed text, and `zgrep` greps inside it\n",
    "directly. Let's make a compressed copy to practise on."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "# (One small setup step in Python — the shell can READ a .gz but not create one here.)\n",
    "import gzip, shutil, workshop_shell\n",
    "root = workshop_shell._find_root()\n",
    "with open(f\"{root}/data/example/example.fastq\", \"rb\") as fi, \\\n",
    "     gzip.open(f\"{root}/data/example/example.fastq.gz\", \"wb\") as fo:\n",
    "    shutil.copyfileobj(fi, fo)\n",
    "print(\"Created data/example/example.fastq.gz\")"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "%%bash\n",
    "# The .gz is much smaller; compare it to the original\n",
    "ls -lh data/example/example.fastq data/example/example.fastq.gz"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "%%bash\n",
    "# zcat = \"cat a .gz\": stream its contents without leaving an unzipped file behind\n",
    "zcat data/example/example.fastq.gz | head -n 4"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "%%bash\n",
    "# zgrep greps straight into the compressed file — count the reads\n",
    "zgrep -c '^@' data/example/example.fastq.gz"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## Section 6 · Automating with Python *(the bridge to the rest of the workshop)*\n",
    "\n",
    "Everything above was **bash**, and bash is perfect for filtering, counting, and one-liners.\n",
    "But when you need to loop with logic, do real arithmetic, or feed results into plots and stats,\n",
    "you wrap those same shell commands in **Python** — which is what Modules 2–8 do.\n",
    "\n",
    "The bridge is one helper: `run(\"...\")` runs a shell command and hands its output back to Python\n",
    "as a string. Same commands you just learned — now you can compute on the result."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "from workshop_shell import run\n",
    "\n",
    "# Capture a shell command's output, then use it as a Python number\n",
    "n_reads = int(run(\"grep -c '^@' data/example/example.fastq\").strip())\n",
    "n_seqs  = int(run(\"grep -c '^>' data/example/example.fasta\").strip())\n",
    "print(f\"{n_reads} reads in the FASTQ, {n_seqs} sequences in the FASTA\")"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "# Arithmetic across rows is where Python (or awk's END block) shines.\n",
    "# Here we parse a shell pipeline's output and compute the mean gene length.\n",
    "rows = run(\"tail -n +2 data/example/genes.tsv | cut -f4,5\").strip().split(\"\\n\")\n",
    "lengths = [int(end) - int(start)\n",
    "           for start, end in (r.split(\"\\t\") for r in rows)\n",
    "           if start.isdigit() and end.isdigit()]\n",
    "print(f\"{len(lengths)} genes, mean length {sum(lengths) / len(lengths):,.0f} bp\")"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "# pathlib builds output filenames safely — no fragile string-pasting. This is how\n",
    "# real pipelines name their outputs, one per input sample.\n",
    "from pathlib import Path\n",
    "\n",
    "def output_path(fastq_name, suffix, outdir):\n",
    "    stem = fastq_name.replace(\".fastq.gz\", \"\").replace(\".fastq\", \"\")\n",
    "    return Path(outdir) / f\"{stem}{suffix}\"\n",
    "\n",
    "for fastq in [\"SRR001.fastq.gz\", \"SRR002.fastq\", \"tumor.fastq.gz\"]:\n",
    "    bam = output_path(fastq, \".bam\", \"results/alignment\")\n",
    "    vcf = output_path(fastq, \".vcf.gz\", \"results/variants\")\n",
    "    print(f\"{fastq:18s} -> {bam}  |  {vcf}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## Exercises\n",
    "\n",
    "Try each in a `%%bash` cell (or the sandbox) *before* peeking — every solution is one short\n",
    "pipeline. Edit and re-run freely.\n",
    "\n",
    "**1.** Count the sequences in `example.fasta`. &nbsp;\n",
    "**2.** Find the longest FASTA header and its length. &nbsp;\n",
    "**3.** Compute the mean gene length from `genes.tsv`. &nbsp;\n",
    "**4.** Count genes on the `+` strand vs the `-` strand."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Solution 1 — count sequences** (count the `>` headers):"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "%%bash\n",
    "grep -c '^>' data/example/example.fasta"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Solution 2 — longest header.** Print each header's length next to it, sort numerically, take the top:"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "%%bash\n",
    "grep '^>' data/example/example.fasta | awk '{print length, $0}' | sort -rn | head -n 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Solution 3 — mean gene length.** `awk` accumulates `end-start` over all rows, then divides in an `END` block. We `grep -v NULL` first so the one row with missing coordinates doesn't skew the count — same cleaning step as before, and now the answer matches the Python version exactly:"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "%%bash\n",
    "grep -v NULL data/example/genes.tsv | awk 'NR>1 {total += $5 - $4; n++} END {print \"mean length:\", int(total / n), \"bp\"}'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Solution 4 — genes per strand.** Strand is column 6; the now-familiar count-by pattern:"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "%%bash\n",
    "tail -n +2 data/example/genes.tsv | cut -f6 | sort | uniq -c"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## Summary\n",
    "\n",
    "| Command | What it does | In bioinformatics |\n",
    "|---|---|---|\n",
    "| `ls -lh` | list files with sizes | check output files exist |\n",
    "| `head` / `tail` | first / last N lines | peek at FASTQ/VCF/SAM |\n",
    "| `wc -l` | count lines | count reads (÷4 for FASTQ) |\n",
    "| `grep -c '^>'` | count matching lines | count FASTA records |\n",
    "| `grep -v '^#'` | drop matching lines | skip VCF/GFF headers |\n",
    "| `cut -f3` | extract columns | pull chrom / coords |\n",
    "| `sort \\| uniq -c` | count occurrences | distribution by category |\n",
    "| `awk '{...}'` | compute on columns | gene length, filtering |\n",
    "| `zcat` / `zgrep` | read `.gz` without unzipping | work with `.fastq.gz` |\n",
    "| `for ...; do ...; done` | repeat over a list | process many samples |\n",
    "| `run(\"...\")` (Python) | capture shell output | automate + analyse |\n",
    "\n",
    "**Key takeaways**\n",
    "\n",
    "- Pipes (`|`) are the most powerful idea at the terminal — small tools, snapped together.\n",
    "- Always `sort` before `uniq -c`, or your counts will be wrong.\n",
    "- Use `set -euo pipefail` at the top of every real shell script.\n",
    "- Bash for filtering and counting; Python (Modules 2–8) when you need logic, math, and plots.\n",
    "\n",
    "**Next:** Module 2 — Sequence Data Formats (FASTA, FASTQ, SAM, BAM, VCF)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.10.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}