在线模式搜索StringSet

时间:2015-10-05 20:47:29

标签: c++ bioinformatics string-search seqan

The SeqAn tutorial for Pattern Matching提到StringSet可以作为干草堆或针头。在尝试使用StringSet作为干草堆时,如下所示,

StringSet<Dna5String> seqs;

/* do stuff to load sequences into seqs */

Finder<StringSet<Dna5String> > finder(seqs);
Pattern<Dna5String, Simple> pattern(Dna5String("GAATTC"));

if (find(finder, pattern))
{
  std::cout << '[' << beginPosition(finder) << ',' << endPosition(finder)
            << ")\t" << infix(finder) << std::endl;
} else
{
  std::cout << "No match!";
}

我收到错误:

  

错误:使用重载运算符'=='是不明确的(操作数类型'const const seqan :: String,seqan :: Alloc&gt;'和'const seqan :: SimpleType')

任何人都知道如何正确地完成这项工作?

Dna5String中使用单个Finder可以正常工作。本教程确实展示了如何进行离线搜索(即使用索引),但这不是我想要的。如果SeqAn中的Finder-Pattern工具已经处理它,我宁愿不必手动迭代StringSet

1 个答案:

答案 0 :(得分:1)

你可以尝试,

#include <iostream>
#include <seqan/sequence.h>  // CharString, ...
#include <seqan/find.h>
#include <seqan/stream.h>

using namespace seqan;

typedef Iterator<StringSet<Dna5String> >::Type TStringSetIterator;

int main(int, char const **)
{
    StringSet<Dna5String> seqs;
    Dna5String seq1 =
        "TAGGTTTTCCGAAAAGGTAGCAACTTTACGTGATCAAACCTCTGACGGGGTTTTCCCCGTCGAAATTGGGTG"
        "TTTCTTGTCTTGTTCTCACTTGGGGCATCTCCGTCAAGCCAAGAAAGTGCTCCCTGGATTCTGTTGCTAACG"
        "AGTCTCCTCTGCATTCCTGCTTGACTGATTGGGCGGACGGGGTGTCCACCTGACGCTGAGTATCGCCGTCAC"
        "GGTGCCACATGTCTTATCTATTCAGGGATCAGAATTCATTCAGGAAATCAGGAGATGCTACACTTGGGTTAT"
        "CGAAGCTCCTTCCAAGGCGTAGCAAGGGCGACTGAGCGCGTAAGCTCTAGATCTCCTCGTGTTGCAACTACA"
        "CGCGCGGGTCACTCGAAACACATAGTATGAACTTAACGACTGCTCGTACTGAACAATGCTGAGGCAGAAGAT"
        "CGCAGACCAGGCATCCCACTGCTTGAAAAAACTATNNNNCTACCCGCCTTTTTATTATCTCATCAGATCAAG";
    Dna5String seq2 =
        "ACCGACGATTAGCTTTGTCCGAGTTACAACGGTTCAATAATACAAAGGATGGCATAAACCCATTTGTGTGAA"
        "AGTGCCCATCACATTATGATTCTGTCTACTATGGTTAATTCCCAATATACTCTCGAAAAGAGGGTATGCTCC"
        "CACGGCCATTTACGTCACTAAAAGATAAGATTGCTCAAANNNNNNNNNACTGCCAACTTGCTGGTAGCTTCA"
        "GGGGTTGTCCACAGCGGGGGGTCGTATGCCTTTGTGGTATACCTTACTAGCCGCGCCATGGTGCCTAAGAAT"
        "GAAGTAAAACAATTGATGTGAGACTCGACAGCCAGGCTTCGCGCTAAGGACGCAAAGAAATTCCCTACATCA"
        "GACGGCCGCGNNNAACGATGCTATCGGTTAGGACATTGTGCCCTAGTATGTACATGCCTAATACAATTGGAT"
        "CAAACGTTATTCCCACACACGGGTAGAAGAACNNNNATTACCCGTAGGCACTCCCCGATTCAAGTAGCCGCG";

    clear(seqs);
    appendValue(seqs, seq1);
    appendValue(seqs, seq2);

    Pattern<Dna5String, Simple> pattern(Dna5String("GAATTC"));

    //For each sequence in seqs
    for (TStringSetIterator it = begin(seqs); it != end(seqs); ++it)
    {
        std::cout << *it << std::endl;
        //I create a finder for each sequence in seqs
        Finder<Dna5String> finder(*it);
        if (find(finder, pattern)){
            std::cout << '[' << beginPosition(finder) << ',' << endPosition(finder)
                      << ")\t" << infix(finder) << std::endl;
        }else{
            std::cout << "No match!" << std::endl;
        }
    }
    return 0;
}

你得到:

TAGGTTTTCCGAAAAGGTAGCAACTTTACGTGATCAAACCTCTGACGGGGTTTTCCCCGTCGAAATTGGGTGTTTCTTGTCTTGTTCTCACTTGGGGCATCTCCGTCAAGCCAAGAAAGTGCTCCCTGGATTCTGTTGCTAACGAGTCTCCTCTGCATTCCTGCTTGACTGATTGGGCGGACGGGGTGTCCACCTGACGCTGAGTATCGCCGTCACGGTGCCACATGTCTTATCTATTCAGGGATCAGAATTCATTCAGGAAATCAGGAGATGCTACACTTGGGTTATCGAAGCTCCTTCCAAGGCGTAGCAAGGGCGACTGAGCGCGTAAGCTCTAGATCTCCTCGTGTTGCAACTACACGCGCGGGTCACTCGAAACACATAGTATGAACTTAACGACTGCTCGTACTGAACAATGCTGAGGCAGAAGATCGCAGACCAGGCATCCCACTGCTTGAAAAAACTATNNNNCTACCCGCCTTTTTATTATCTCATCAGATCAAG
[247,253)   GAATTC
ACCGACGATTAGCTTTGTCCGAGTTACAACGGTTCAATAATACAAAGGATGGCATAAACCCATTTGTGTGAAAGTGCCCATCACATTATGATTCTGTCTACTATGGTTAATTCCCAATATACTCTCGAAAAGAGGGTATGCTCCCACGGCCATTTACGTCACTAAAAGATAAGATTGCTCAAANNNNNNNNNACTGCCAACTTGCTGGTAGCTTCAGGGGTTGTCCACAGCGGGGGGTCGTATGCCTTTGTGGTATACCTTACTAGCCGCGCCATGGTGCCTAAGAATGAAGTAAAACAATTGATGTGAGACTCGACAGCCAGGCTTCGCGCTAAGGACGCAAAGAAATTCCCTACATCAGACGGCCGCGNNNAACGATGCTATCGGTTAGGACATTGTGCCCTAGTATGTACATGCCTAATACAATTGGATCAAACGTTATTCCCACACACGGGTAGAAGAACNNNNATTACCCGTAGGCACTCCCCGATTCAAGTAGCCGCG
No match!

编辑,我希望这可以帮到你

....
#include <seqan/index.h>
....

Pattern<Dna5String> pattern(Dna5String("GAATTC"));
Index< StringSet<Dna5String > > myIndex(seqs);
Finder< Index<StringSet<Dna5String > > > finder(myIndex);
while (find(finder, pattern)){
    std::cout << '[' << beginPosition(finder) << ',' << endPosition(finder)
              << ")\t" << infix(finder) << std::endl;
}   
....

你明白了,

[< 0 , 247 >,< 0 , 253 >)   GAATTC