Day 1 · Module 209:45 – 10:30·45 min

Sequence Data Formats

FASTA, FASTQ, SAM/BAM, VCF, BED, GFF — the file formats genomics runs on.

What this module covers

▸FASTA: structure, headers, multi-sequence files
▸FASTQ: quality scores, Phred encoding, base calling
▸SAM/BAM: header sections, FLAGS, CIGAR strings
▸VCF: variant representation, INFO/FORMAT fields
▸BED/GFF3: genomic intervals

Start here — the data journey

live in your browser · no install

Watch the data move through the pipeline below, then read on — each section has its own interactive explorer embedded right where the code builds that figure, so you can turn the knobs as you go.

The data journey — one experiment, five file formats

1. FASTA The genome we measure everything against — plain sequence, one '>' header per record. No quality, no coordinates, just the letters.

answers: what the reference sequence is

2. FASTQ What the sequencer emits: millions of short reads, each four lines — header, bases, '+', and a per-base quality string.

answers: the called bases + how confident each call is

3. SAM / BAM Each read placed on the reference: position, a FLAG of bit-packed properties, mapping quality, and a CIGAR string for the match. BAM is just the compressed binary form.

answers: where each read maps and how cleanly

4. VCF The distilled result: one row per position where the sample differs from the reference — REF/ALT alleles, a QUAL score, a FILTER verdict, and INFO/FORMAT fields.

answers: how the sample differs from the reference

5. BED / GFF Coordinates given meaning: genes, exons, peaks, regions of interest. BED is minimal (chrom-start-end); GFF3 is richer (type, parent, attributes).

answers: what the regions on the genome are

raw bytes

>chr22
GATTACAGGCCTTAACCGGTT
ACGTACGTACGTACGTACGTA

@SEQ_001 read1
GATTACAGATTACATTGG
+
IIIIIHHHGGFEDCBA@?

SEQ_001  0  chr22  101  60  18M  *  0  0  GATT…
SEQ_002 16  chr22  140  60  5M1D12M …

#CHROM POS    ID  REF ALT QUAL FILTER INFO
chr22  17280632 .  G   A   982  PASS   DP=54;AF=0.5

chr22  17280000  17285000  GENEA  0  +
chr22  17280632  17280633  SNP_hit

Read left to right, this is the spine of nearly every genomics pipeline: FASTQ → (align to FASTA) → SAM/BAM → (call) → VCF. The rest of this module opens up each format byte by byte.

The notebook — live & editable

runs in your browser · no install

Every section's code is already filled in below. Press the ▶ next to any cell (or Shift+Enter) to run it, edit it and run again, or hit Run all to execute the whole notebook top to bottom. No Python or Jupyter install needed — the kernel boots right here in your browser.

Python kernel — not started

first run downloads the runtime (~once, a few seconds)open in full Jupyter ↗

Loading notebook…