Introduction:
SKM (String Kernel Motifs) is a stand-alone tool for determining
if a sequence shares similar motifs that exist in a family of sequences. It
does this using a technique called string kernels which are implemented
in a support vector machine, which is a very efficient classification
algorithm. String kernels have the added advantage that it is not
necessary to perform a multiple alignment. Different variants of string
kernels have now been applied in a wide variety of different areas of
computational biology (as can be seen
here), including
protein-protein interactions,
protein family classification the prediction of
transcriptional regulatory modules,
HIV-1 coreceptors,
siRNA silencing efficacy,
miRNA precursors,
binding peptides for MHC II class molecules and
bacterial transcription start sites.
The emphasis for SKM is provide a package for non-Computational
Biologists with a simple interface that can be used on a variety of
different architectures (i.e. Windows or Unix-compliant operating
systems). It can be run for
either nucleic acid or protein sequences.
SKM provides two possible types of string kernel. The first is based on
the frequency of all motifs of a particular length (the user can choose
what that length can be). The second is the so-called mismatch kernel
where occurences of motifs that differ by only one nucloetide (or
amino acid) are binned together.
Use:
Like all machine learning tools there are two phases: training and
testing. In the training phase SKM requires a true set of
sequences from the family that the user wishes to check for and
a false set to compare against. This is used to generate a
classifier whose parameters can be stored and used at a later time. SKM
will, if necessary, generate
the false set by randomly shuffling the true sequences, but as pointed
out by
Ben-Hur and Stafford Noble
such choices can introduce biases and it is better to select
biologically relevant false cases; for example in classifying
transcription factor binding sites picking random upstream regions in
the human genome. In the test phase, the resulting trained classifier
is used to test query sequences to see if they belong to the
family.
What SKM is not:
SKM is not a motif search algorithm in that it does not try and
find over-represented motifs (string kernels can detect motifs that
occur in the same sequence but may not be adjacent to each other).
In the same respect it does not give a visualisation of the
motifs that occur.
Hugh.Shanahan@rhul.ac.uk