How to read BioSequence results?

Modified on Wed, 13 Nov, 2024 at 11:28 AM

TABLE OF CONTENTS

Check your results and the searched sequences
If you have no result, check the way your search
Refine your results with the Identity filters
The good percentage (%) identities is…
Export hit value fields are?
Other filters to consider
Learn more about OBS specific columns and display
The alignment tab: a powerful display of sequences

Want to simply read the alignements ? Check this dedicated article How to read your alignments

Check your results and the searched sequences

When launching an OBS run, you need to set a number of unique sequences as a maximum number. Since you never know how many results you might have, a priori, you can leave it to the default or set it up a bit to, say, 500. Once the results are computed, you need to see if that number was correct. To verify that, go to the last page and check the worst results. If those are below what you would consider as good, then you should have the results you want. If the worst results are still good enough, you need to re-launch your OBS run and increase the maximal number of unique sequence results.

More at How to search for a sequence

If you have no result, check the way your search

That is always worrisome, indeed. Do I have 0 result because my sequence or parameters are not correct or, is there really 0 result (which is usually good news for you!). If you have done a MOTIF search, there is often a real possibility of getting 0 result since your motif might be too restrictive. You might want to loosen it a bit to verify that you are not mistaken. If you have done a blast search with a very short sequence, it might be due to the set up. For very short DNA sequences (such as a 10 residue primer) or a particularly short CDR (5 amino acid long for instance), you need to select the “short input sequences” flag and increase the Max E-value to its maximum (5M). Those are common mistakes. In doubt, our wonderful support team is always there to help.

More at Biosequence: MOTIF searching

Refine your results with the Identity filters

Let us first define what percentage identities mean. A percentage identity is always computed over something where this something is the query, subject or alignment. So, the percentage identity of the query is the percentage of query residues that match. Similarly over the subject and alignment. Note that this might look a bit strange with gaps. For instance, a query where all residues match, but gaps are introduced in the query sequence still means that 100% of residues of the query match. Thus, a 100% query identity does not mean a perfect matching sequence is found.

So, which one should you use? Let’s say you want to find similar subject sequences to your query or queries. In this case, you can set both %identity over the query and subject to a high value, say, 80. That will guarantee that all your hits will be very close, i.e. the queries and subjects will be very similar to each other. Another case is when you have one or more short queries and you want them embedded in subject sequences (think CDRs and chains). Here you will want to set only the %query identity.

The good percentage (%) identities is…

Unfortunately, and this get asked often, there is no good answer to this question. Some patents will claim sequences with 70% identity, some 90%. If the sequences are very short, the number of mismatches or some specific substitutions are mentioned or claimed. Roughly speaking, 80% identity over the query or subject are generally considered good percentage identities, but again, this might vary from case to case.

Export hit value fields are?

Please find below a table with all the value fields:

SQID	A unique number for a unique sequence. The cell is color-coded such that 2 identical SQIDs (i.e. sequences) have the same color.
Query with hits	The full name of the query sequence
PN-SEQID	The patent number - sequence number (SEQ ID NO)
Claimed in	If not empty, it lists the claims in which the SEQIDNO is mentioned
Organism	If available in the ST5/ST26 sequence listing, the Organism field is listed
Features	If available, any extra feature from the ST25/ST26 is shown
%query	The percentage identity over the query computed as : 100 times the number of matching residues divided by the query length
%subject	The percentage identity over the subject computed as : 100 times the number of matching residues divided by the subject length
%alignment	The percentage identity over the alignment computed as : 100 times the number of matching residues divided by the alignment length
Coverage query	The percentage of the Query covered by the alignment
Coverage subject	The percentage of the Subject covered by the alignment
E-value	The expect value
Blast score	The Blast score
Frame/strand query	The frame for a nucleotide sequence translated into amino acids. Values are -3,-2,-1,+1,+2,+3. For a nucleotide sequence untranslated, the strands can be FOR (forward) or REV (reverse complement). A protein sequence is always FOR.
Frame/strand subject	The frame for a nucleotide sequence translated into amino acids. Values are -3,-2,-1,+1,+2,+3. For a nucleotide sequence untranslated, the strands can be FOR (forward) or REV (reverse complement). A protein sequence is always FOR.
Original query from	Before any translation or reverse complement, the beginning of the alignment for the query
Original query to	Before any translation or reverse complement, the end of the alignment for the query
Original subject from	Before any translation or reverse complement, the beginning of the alignment for the subject
Original subject to	Before any translation or reverse complement, the end of the alignment for the subject
In frame/strand query from	After translation or reverse complement, the beginning of the alignment for the query
In frame/strand query to	After translation or reverse complement, the end of the alignment for the query
In frame/strand subject from	After translation or reverse complement, the beginning of the alignment for the subject
In frame/strand subject to	After translation or reverse complement, the end of the alignment for the subject
Number of gaps	The number of gaps (query or subject) in the alignment
Number of errors	The number of gaps (query or subject) and mismatches in the alignment
Alignment size	The size of the alignment
Subject size	The subject size

Other filters to consider

Here is a list of other filters and there uses:

• Number of errors

o This is the number of errors in the alignment, errors being mismatches and gaps. It can be used to control the quality of the alignment with more finesse.

• Number of gaps

o When one wants to separate gapped alignments from non-gapped alignments.

• Limit to claims

o Only claimed sequences will be shown as hits.

• Query name

o When using several query sequences, selecting all or some query names will show only families where all the selected queries have hits. This is particularly useful when one wants to find families that have hits with all one’s CDRs.

• Subject length and alignment length

o Those filters apply to the length of the subject or alignment. One uses those filters to control for long subjects (think genomic subsequences) or very long alignments which can occur with MOTIF searching.

• Organism

o We normalize organism names linked to patent sequences up to a certain point. Though not a perfect system, it allows for finer control over aligned sequences.

More at Biosequence specific filters

Learn more about OBS specific columns and display

There are 7 OBS specific columns that can be shown next to the FAMPAT columns (Title, Assignee, …). The “display” menu (next to the printer symbol) above the family rows controls which columns are displayed. Note that those numbers are only computed once when you open your results. Any subsequent filtering will not change those static numbers.

• Best %QID

o The largest percentage identity over the query for this family

• Claimed seq.

o Yes or no: Is any hit subject sequence claimed?

• Unique seq. hits

o The number of unique sequences that is a hit

• Longest Alignment

• Nb queries w/ hits

o If you use several queries, the number of queries that have hits in this family, otherwise it is 1.

• Nb pub. w/ hits

o Number of different publications of the family that have hits.

• List of queries w/ hits

o The names of the queries that have hits in the family.

The alignment tab: a powerful display of sequences

The right-hand side alignment tab is dynamically recomputed depending on the filters you have used. It shows alignment information on the currently highlighted family.

It first shows the queries with hits, then the publications in this family with the number of hits per publications and the total number of sequences known in this publication.

Following those headers, the query name is visible. You can click on the little triangle on its left to open and close all following alignments relating to the query; for instance, to see other query hits.

The alignment area starts with a “sequence” and a list of publications and SEQ ID NO. This sequence is common, in this family, to all the combinations of publications and SEQ ID Nos listed. Clicking on the word sequence will pop-up a little window with the raw sequence.

The alignments are shown in two forms, a graphical representation and, by clicking on “Details”, the traditional textual alignment. The graphical representation shows a lot of information in the nicest way possible. It includes query and subject sequences size, start and stop of the alignment, frames or strands (FW for forward, REV for reverse complement, -3 to +3 for forward and reverse complement frames) as well as the number of errors and matches, and the blast scores and e-values. Note that the coordinates are always the original sequence coordinates even when the sequence is used in a specific frame.

The detailed alignment is more textual and is followed by more alignment features (number of gaps, …) and some details on each published versions such as claim status, organism.