Vocabulary for the BioSequence module

Modified on Wed, 13 Nov, 2024 at 11:26 AM

Alignment Identity %: number of matches over alignment length in our example 63/(63+49)*100=56.2%

BLAST: it is an acronym for Basic Local Alignment Search Tool. BLAST finds regions of similarity between biological sequences. The program compares your nucleotide or protein sequences (query) to sequences within a database (subject) and calculates the statistical significance. More details here.

E-value: The BLAST E-value (Expectation value or Expect value) is a measure of the probability that your alignment arose just by chance. In practice, the E-value will always be extremely high (potentially in millions) for very short sequences, and very low (think 10^-22 or even 0) for long sequences.

FASTA Format: text-based format for representing either nucleotide sequences or amino acid (protein) sequences, in which nucleotides or amino acids are represented using single-letter codes. The format also allows for sequence names and comments to precede the sequences. More details here.

Une image contenant texte

Description générée automatiquement

Gap: A space is introduced into a query or subject sequence to compensate for insertions and deletions in one sequence relative to another. To prevent the accumulation of too many gaps in an alignment, introduction of a gap causes the deduction of a fixed amount (the gap score) from the alignment score. Extension of the gap to encompass additional nucleotides or amino acid is also penalized in the scoring of an alignment.

HSP : A High-scoring Segment Pair (HSP) is a local alignment that achieves one of the highest alignment scores in a given search. Aligning two sequences can result in several HSPs instead of one alignment with hundreds of gaps or mismatches.

Motif : A motif or pattern is a way to describe a sequence match. For instance, ATAGAGATGAGAGAT[GA]TATAGAGA is a motif of a sequence where at one position, you want a G or a A. Motif searching can be used for finding exact sequences, SNPs, protein motifs, or specific mutations. More about the motif search here.

Query Identity %: number of query matches over query length. in our example : 63/107*100=58.8%. Available in the filters like other percentages below.

Query Coverage %: the length of the aligned query divided by the length of the query in our example (107-1+1)/107*100=100%

Query sequence: corresponds to the input sequence, “your” sequence, the sequence given to the tool for searching.

SQID is our internal numbering of the sequences in our database. Each sequence is unique and has a specific SQID. If two different families contain the exact same sequence, the same SQID (and thus color) will be shown in the XLS export. The coloring helps you reviewing your results and grouping them by SQID for example.

Subject Coverage %: the length of the aligned subject divided by the length of the subject.

in our example (224-114+1) / 225=49.3%

Subject Identity %: number of subject matches over subject length

in our example : 63/225*100=28%

Subject sequence: correspond to the sequences that are within the database, the “hit” sequence

Word size: The absolute minimal length of a perfect match between 2 sequences. This value is important in case of very short sequences.

Open/Extend gap penalty: The penalty given to a score for the first gap and the following gaps.

Match/Mismatch cost: For nucleotide comparisons only, the score increase for a match between two identical nucleotides and the score decrease if they do not match