Day 1 · Module 209:45 – 10:30·45 min

Sequence Data Formats

FASTA, FASTQ, SAM/BAM, VCF, BED, GFF — the file formats genomics runs on.

What this module covers

  • FASTA: structure, headers, multi-sequence files
  • FASTQ: quality scores, Phred encoding, base calling
  • SAM/BAM: header sections, FLAGS, CIGAR strings
  • VCF: variant representation, INFO/FORMAT fields
  • BED/GFF3: genomic intervals
Download .ipynb

Start here — the data journey

live in your browser · no install

Watch the data move through the pipeline below, then read on — each section has its own interactive explorer embedded right where the code builds that figure, so you can turn the knobs as you go.

The data journey — one experiment, five file formats

1. FASTA The genome we measure everything against — plain sequence, one '>' header per record. No quality, no coordinates, just the letters.

answers: what the reference sequence is

raw bytes

>chr22
GATTACAGGCCTTAACCGGTT
ACGTACGTACGTACGTACGTA

Read left to right, this is the spine of nearly every genomics pipeline: FASTQ → (align to FASTA) → SAM/BAM → (call) → VCF. The rest of this module opens up each format byte by byte.

The notebook — live & editable

runs in your browser · no install

Every section's code is already filled in below. Press the ▶ next to any cell (or Shift+Enter) to run it, edit it and run again, or hit Run all to execute the whole notebook top to bottom. No Python or Jupyter install needed — the kernel boots right here in your browser.

Python kernel — not started
first run downloads the runtime (~once, a few seconds)open in full Jupyter ↗
Loading notebook…