Sequence Data Formats
FASTA, FASTQ, SAM/BAM, VCF, BED, GFF — the file formats genomics runs on.
What this module covers
- ▸FASTA: structure, headers, multi-sequence files
- ▸FASTQ: quality scores, Phred encoding, base calling
- ▸SAM/BAM: header sections, FLAGS, CIGAR strings
- ▸VCF: variant representation, INFO/FORMAT fields
- ▸BED/GFF3: genomic intervals
Start here — the data journey
live in your browser · no installWatch the data move through the pipeline below, then read on — each section has its own interactive explorer embedded right where the code builds that figure, so you can turn the knobs as you go.
The data journey — one experiment, five file formats
1. FASTA The genome we measure everything against — plain sequence, one '>' header per record. No quality, no coordinates, just the letters.
answers: what the reference sequence is
2. FASTQ What the sequencer emits: millions of short reads, each four lines — header, bases, '+', and a per-base quality string.
answers: the called bases + how confident each call is
3. SAM / BAM Each read placed on the reference: position, a FLAG of bit-packed properties, mapping quality, and a CIGAR string for the match. BAM is just the compressed binary form.
answers: where each read maps and how cleanly
4. VCF The distilled result: one row per position where the sample differs from the reference — REF/ALT alleles, a QUAL score, a FILTER verdict, and INFO/FORMAT fields.
answers: how the sample differs from the reference
5. BED / GFF Coordinates given meaning: genes, exons, peaks, regions of interest. BED is minimal (chrom-start-end); GFF3 is richer (type, parent, attributes).
answers: what the regions on the genome are
raw bytes
>chr22 GATTACAGGCCTTAACCGGTT ACGTACGTACGTACGTACGTA
@SEQ_001 read1 GATTACAGATTACATTGG + IIIIIHHHGGFEDCBA@?
SEQ_001 0 chr22 101 60 18M * 0 0 GATT… SEQ_002 16 chr22 140 60 5M1D12M …
#CHROM POS ID REF ALT QUAL FILTER INFO chr22 17280632 . G A 982 PASS DP=54;AF=0.5
chr22 17280000 17285000 GENEA 0 + chr22 17280632 17280633 SNP_hit
Read left to right, this is the spine of nearly every genomics pipeline: FASTQ → (align to FASTA) → SAM/BAM → (call) → VCF. The rest of this module opens up each format byte by byte.
The notebook — live & editable
runs in your browser · no installEvery section's code is already filled in below. Press the ▶ next to any cell (or Shift+Enter) to run it, edit it and run again, or hit Run all to execute the whole notebook top to bottom. No Python or Jupyter install needed — the kernel boots right here in your browser.