在字符串中获得最频繁的k长度k mers

时间:2013-11-18 01:19:17

标签: c#

假设我有一个像"01xTTx10TxT1x10Tx0Tx10Tx0x0x1T"这样的字符串我想生成该字符串的所有4个字符串

01xT Tx10 TxT1 x10T x0Tx 10Tx 0x0x 1T
1xTT x10T xT1x 10Tx 0Tx1 0Tx0 x0x1 T
xTTx 10Tx T1x1 0Tx0 Tx10 Tx0x 0x1T
...

然后,拥有所有人,知道哪些是最快乐的。

为此,我打算创建一个字典,并根据appereance增加一个计数,如:

string original = "01xTTx10TxT1x10Tx0Tx10Tx0x0x1T";
int size = 4;
string[] arr = "" ;  // how to gen kmers(original,4); ?

Dictionary<string, int> dictionary = new Dictionary<string, int>();
foreach (string word in arr) //loop over all kmers
{
   if (dictionary.ContainsKey(word)) //if it's in the dictionary
       dictionary[word] = dictionary[word] + 1; //Increment the count
   else
       dictionary[word] = 1; //put it in the dictionary with a count 1
}
foreach (KeyValuePair<string, int> pair in dictionary) //loop through the dictionary
     System.Console.Write(string.Format("{0} {1} \n", pair.Key, pair.Value));

但我不确定如何有效地从字符串

生成所有4个大小的kemrs

所以在这个例子中我应该得到大多数frecuent kmers

10Tx 
x10T

是原始字符串中最常用的单词 01xTTx 10Tx T1x 10Tx 0Tx 10Tx 0x0x1T
和   01xTT x10T xT1 x10T x0T x10T x0x0x1T

或者知道结果的更好方法是什么?

1 个答案:

答案 0 :(得分:1)

制作循环以获取所有kmers,然后使用正则表达式查找输入字符串中的所有匹配项。最常见的kmer将获胜。 像这样:

var term = "01xTTx10TxT1x10Tx0Tx10Tx0x0x1T";
var dict = new Dictionary<string, int>();
for (var i = 0; i < term.Length - 4; i++)
{
    var kmer = term.Substring(i, 4);
    if (!dict.ContainsKey(kmer))
        dict[kmer] = Regex.Matches(term, kmer).Count;

}

var maxOccurring = dict.Max(m => m.Value);
var maxOccurringTerm = dict.Where(l => l.Value == maxOccurring);