What is a FASTA format

Modified on Fri, 2 Oct, 2020 at 5:14 PM

The FASTA format is a text-based format for representing either nucleotide sequences or amino acid (protein) sequences, in which nucleotides or amino acids are represented using single-letter codes. The format also allows for sequence names and comments to precede the sequences. This description line begins with a '>' and gives a name or a unique identifier to the sequence. It may also contain additional information.

A more complete example is shown below. It contains identifiers, descriptions and multiple sequences.

>sp|J7RUA5|CAS9_STAAU Start of CRISPR-associated endonuclease Cas9 OS=Staphylococcus aureus

MKRNYILGLDIGITSVGYGIIDYETRDVIDAGVRLFKEANVENNEGRRSKRGARRLKRRR

RHRIQRVKKLLFDYNLLTDHSELSGINPYEARVKGLSQKLSEEEFSAALLHLAKRRGVHN

VNEVEEDTGNELS

>sp|Q99ZW2|CAS9_STRP1 Start of CRISPR-associated endonuclease Cas9/Csn1 OS=Streptococcus pyogenes

MDKKYSIGLDIGTNSVGWAVITDEYKVPSKKFKVLGNTDRHSIKKNLIGALLFDSGETAE

ATRLKRTARRRYTRRKNRICYLQEIFSNEMAKVDDSFFHRLEESFLVEEDKKHERHPIFG

NIVDEVAYHEKYPTIYHLRKKLVDSTDKADLRLIYLALAHMIKFRGHFLIEGDLNPDNSD

VDKLFIQLVQT

>sp|G3ECR1|CAS9_STRTR Start of CRISPR-associated endonuclease Cas9 OS=Streptococcus thermophilus

MLFNKCIIISINLDFSNKEKCMTKPYSIGLDIGTNSVGWAVITDNYKVPSKKMKVLGNTS

KKYIKKNLLGVLLFDSGITAEGRRLKRTARRRYTRRRNRILYLQEIFSTEMATLDDAFFQ

RLDDSFLVPDDKRDSKYPIF

An identifier is composed of alphanumeric characters, _ (underscores) and - (hyphens). Do not put spaces in an identifier.