Day 1 · Module 109:00 – 09:45·45 min

Linux CLI for Bioinformatics

Pipes, grep, awk, and the shell scripting that powers every genomics pipeline.

What this module covers

▸File system navigation, pipes, grep, awk, cut, sort, uniq
▸Working with compressed files (gzip, bgzip, tabix)
▸Driving shell from Python via subprocess and pathlib
▸Writing reproducible shell scripts (set -euo pipefail)

Download .ipynb Exercises Solutions (bash)

Workshop recording

🎥

The recorded walkthrough for this module will appear here.

Coming after the live workshop.

Try it live — sandbox shell

runs in your browser · no install

New to the command line? You don't need to install anything. This is a real (simulated) bash shell with the workshop's sample data already loaded. Work through the tasks on the right, or just type help to explore.

student@sandbox: ~ — bash

student@sandbox:~$

Start here — the data journey

live in your browser · no install

Watch the data move through the pipeline below, then read on — each section has its own interactive explorer embedded right where the code builds that figure, so you can turn the knobs as you go.

The data journey — one pipeline, six small commands

| | | | |

1. genes.tsv A tab-separated file: one gene per row, with columns for chromosome, start, end and name — plus a header line on top.

2. grep -E '^chr' Keep only lines that start with 'chr' — i.e. real data rows. The '#' header line is dropped.

3. cut -f1 Slice out just the first field of every row — the chromosome. Everything else falls away.

4. sort Order the chromosomes alphabetically so identical values sit next to each other — required before we can count runs.

5. uniq -c Collapse each run of identical lines into a single line, prefixed by how many times it occurred.

6. sort -rn Sort by that leading count, numerically and in reverse — so the busiest chromosome lands on top. That's your answer.

output preview

#chrom  start  end   name
chr22   101    900   GENEA
chr1    50     420   GENEB

chr22   101    900   GENEA
chr1    50     420   GENEB
chr22   980   1500   GENEC

chr22
chr1
chr22
chrX
chr1

chr1
chr1
chr22
chr22
chrX

  61 chr1
  58 chr2
  44 chr22
  12 chrX

  61 chr1
  58 chr2
  44 chr22
  12 chrX

Each command does one job and pipes its output into the next — 240 rows narrow to a ranked count of 24 chromosomes. That's the whole philosophy of the shell: grep -E '^chr' | cut -f1 | sort | uniq -c | sort -rn

The notebook — live & editable

runs in your browser · no install

Every section's code is already filled in below. Press the ▶ next to any cell (or Shift+Enter) to run it, edit it and run again, or hit Run all to execute the whole notebook top to bottom. No Python or Jupyter install needed — the kernel boots right here in your browser.

Python kernel — not started

first run downloads the runtime (~once, a few seconds)open in full Jupyter ↗

Heads up: this module's pipeline uses command-line tools (e.g. bwa, samtools) that aren't available in the browser kernel. The Python cells run here; tool/shell lines print a note instead.

Loading notebook…