Biosequence Variant : search and use

Modified on Wed, 13 Nov, 2024 at 11:27 AM

TABLE OF CONTENTS

Graphical Variant Analysis introduction
Graphical Variant Analysis Input
Interface Introduction
Display
Export

Graphical Variant Analysis introduction

The ideas behind variant analysis are that you want to see some variations that subject sequences might have with your query. The subject sequences must be quite similar to your queries otherwise there would be so many variations as to render the analysis overwhelming. Thus, to start the analysis you need alignments. Those come from Orbit BioSequence results and can (and should) be filtered to keep only the most interesting alignments and sequences.
When happy with the results you have left, click on the analysis button on the bottom of the main window.

Graphical Variant Analysis Input

Each query sequence is shown on top. Click on any to see the input status.
The input status is composed of 3 parts. The first part is the query where you can click on “view details” and see your query sequence. The second part tells you how many unique sequences and families are involved. The last part is informative and informs you of the selection of unique sequences.
Note that for now, you can only use 1000 unique sequences at a time. This value might change depending on feedback and usage.

Interface Introduction

The main window you can see is the multiple alignment window. It is built from all the alignments of the subjects to the query.
Variations are defined as mismatches in the alignment and gaps in the subjects. Gaps in the query are not defined as variations but are shown with an orange | symbol. The reason behind this is that insertions in the query, if shown, would need to shift every other alignment. This would be very inconvenient and misleading.

Clicking on a sequence

When clicking on any sequence, will show, on the left-hand side, the identifiers of the sequence (S-number, SQID), its length, alignment length, number of variations and alignment information.
The alignment information is shown in compact form, just the mismatches, gaps in subject and the list of inserts, i.e. gaps in query.
This is followed by the list of instances of this sequence. For each family where the sequence is present, the list of publications and SEQ ID Nos is shown.

Clicking on a variation bar

When clicking on a variation bar as shown below, the list of different variations is shown, as well as the list of sequences. Clicking on a sequence will expand to show the same information as when clicking on a sequence (see above).

Limit to sequences

This section allows you to remove some sequences according to some criteria. Notice that all subsections start with a ❒ that can be ✓. This allows for fast switching between on and off.

With at least 1 variation

Remove all sequences with no variations. By the term sequence, here, we mean alignment of the sequence. If the subject sequence partially aligns, but the alignment is 100%, then it will be removed. Note also that subject sequence insertions are not considered as variant.

With up to X variations

Write the value to remove sequences that have more than that value as number of variations. If you set the value to 4, it will remove all sequences with more than 4 variations, that is 5, and more.
With variations at one of those positions:

Display

This section provides filters that will change the look and feel of the display, but will not change the data.
Hide positions with no variation
This very powerful switch will remove the positions where no variations were found, thus compacting the display. It is particularly useful when there are few variations or only variations at the end for instance.
Flag specific positions …
This will simply make some positions in the color green.
Positions with most variations
Shows the top 5 positions in term of number of variations and will color in dark blue all positions with the top 5 number of variations.

Export

The export button will create a XLXS multi-sheet spreadsheet of the current view. Here is a description of the different sheets.

Audit trail : This sheet contains the Audit Trail. It contains the details of your OBS run, your OBS result filters as well as the variant filters.

Variant positional table : This sheet is based on the positions with variations. For each such position it lists the variations, the sequences, the family representative, the publication numbers and SEQ ID Nos. The SQID is an internal identifier for a unique sequence. It is useful to sort, search or group. The Variation is shown as a typical variation annotation in the form of D614G.
Note that this sheet can be very big since there can be many lines for a single variation.

Family positional table : This is a more compact sheet. For each family (noted as the representative), it lists all the different sequences (as SQID), the associated publications and SEQ ID Nos and the variations. For instance, for the entry below:
There are 3 different sequences for this family. Each sequence is found through 2 instances in WO and EP publications. They have different variations shown in the last column.

Multiple alignment : The last sheet contains the multiple alignment as was shown in the Graphical Variant Analysis window (though the “hide positions with no variation” is not taken into account).
The order of sequences and their variation colors are the same too.