String Kernel Motifs Tool


SKM (String Kernel Motifs) is a stand-alone tool for determining if a sequence shares similar motifs that exist in a family of sequences. It does this using a technique called string kernels which are implemented in a support vector machine, which is a very efficient classification algorithm. String kernels have the added advantage that it is not necessary to perform a multiple alignment.   Different variants of string kernels have now been applied in a wide variety of different areas of computational biology (as can be seen here), including protein-protein interactions, protein family classification the prediction of transcriptional regulatory modules, HIV-1 coreceptors, siRNA silencing efficacy, miRNA precursors, binding peptides for MHC II class molecules and bacterial transcription start sites.

The emphasis for SKM is provide a package for non-Computational Biologists with a simple interface that can be used on a variety of different architectures (i.e. Windows or Unix-compliant operating systems). It can be run for either nucleic acid or protein sequences.

SKM provides two possible types of string kernel. The first is based on the frequency of all motifs of a particular length (the user can choose what that length can be). The second is the so-called mismatch kernel where occurences of motifs that differ by only one nucloetide (or amino acid) are binned together.


Like all machine learning tools there are two phases: training and testing. In the training phase SKM requires a true set of sequences from the family that the user wishes to check for and a false set to compare against. This is used to generate a classifier whose parameters can be stored and used at a later time. SKM will, if necessary, generate the false set by randomly shuffling the true sequences, but as pointed out by Ben-Hur and Stafford Noble such choices can introduce biases and it is better to select biologically relevant false cases; for example in classifying transcription factor binding sites picking random upstream regions in the human genome. In the test phase, the resulting trained classifier is used to test query sequences to see if they belong to the family.

What SKM is not:

SKM is not a motif search algorithm in that it does not try and find over-represented motifs (string kernels can detect motifs that occur in the same sequence but may not be adjacent to each other).  In the same respect it does not give a visualisation of the motifs that occur.