Biosequence: MOTIF searching

Modified on Fri, 23 Aug at 4:22 PM

  

You can search with a motif.  Searching for a group of CDRs, specific peptide patterns or SNPs has never been simpler.  


Motif syntax

The extensive syntax is as follow:

  • A letter matching itself (ambiguity characters are expanded)
  • . (dot) for any letter
  • ? for the previous entity 0 or 1 time
  • * for the previous entity 0 or more times
  • + for the previous entity 1 or more times
  • [ ] which contains a list of alternative letters
  • [^] means do not match any of the characters after the ^
  • ( ) to group entities
  • (|) for alternatives
  • {n} where n is a number.  Previous entity matches n times exactly.
  • {n,m} where n and m are numbers.  Previous entity at least n times and at most m times.  n or m can be empty, meaning any number. {1,5}: from 1 to 5 times.
  • ^ meaning must start with: ^ATC: must start with ATC
  • $ meaning must stop with
  • DNA and amino acid ambiguity characters are fully expanded.  For instance DNA ambiguity B (meaning all but A) is expanded to [BCGTU], T and U are expanded to [TU], …
  •  \X is a special case.  For motif searches against proteins, it will match an X. This notation only works as \X, i.e. not with \P or \A.


Search examples:

A simple motif with an alternative amino acid:

[EK]FWEVISDEHGIDPS

3 CDRs with any space in between:

SYWMY.*RIDPNSGSTKYNEKFKN.*DYRKGLYAMDY

Note that .* means any space including none.

Starting alternate triplets (one of which is ambiguous), one to four H or W:

^(DYR|SYW|W.W)EVISDE[HW]{1,4}GID

An exact sequence (starts with ^, ends with $):

^RIDPNSGSTKYNEKFKN$

A list of mutations: S24G, S33T, S53G, S78N, S101N, G128A and L217Q

>motif_WT

^.{23}S.{8}S.{19}S.{24}S.{22}S.{26}G.{88}L

>motif_MUT

^.{23}G.{8}T.{19}G.{24}N.{22}N.{26}A.{88}Q

>motif_BOTH

^.{23}[SG].{8}[ST].{19}[SG].{24}[SN].{22}[SN].{26}[GA].{88}[LQ]


When is a motif search (unexpectedly) useful?

Here are a few cases where a motif search is useful that you might not have thought of:

  • Using an extremely short sequence.  Blast cannot use amino acid sequences shorter than 4 residues.  So to search with a 3 amino acid sequence, you need to use the motif search
  • Are you looking for cases where your exact sequence is found ?  Just use the motif search and add ^ at the beginning of your sequence and $ at the end.
  • Similarly if you are looking for cases where your sequence is found exactly or included in a larger sequence, just use the query sequence as is.

Was this article helpful?

That’s Great!

Thank you for your feedback

Sorry! We couldn't be helpful

Thank you for your feedback

Let us know how can we improve this article!

Select at least one of the reasons
CAPTCHA verification is required.

Feedback sent

We appreciate your effort and will try to fix the article