所以,让我说我有一个32个字符的字符串:
GCAAAGCTTGGCACACGTCAAGAGTTGACTTT
我的目标是计算所有特定子串的出现次数,例如' AA' ' ATT' ' CGG'等等。为此,上面的第3到第5个字符包含2次出现的AA'。这些子串中总共有8个,其中6个长度为3个字符,2个长度为2个字符,我希望所有8个都有计数。
在Java中执行此操作的最有效方法是什么?我的想法有几行:
3的实现是微不足道的,但是1可能会带来一些额外的麻烦,并且不会是非常干净的代码。
答案 0 :(得分:0)
我认为这应该回答你的问题。
天真的方法(检查每个可能索引的子字符串) 在O(nk)中运行,其中n是字符串的长度,k是长度 子串。这可以用for循环实现,和 像haystack.substring(i).startsWith(needle)。
虽然存在更高效的算法。你可能想看看 Knuth-Morris-Pratt算法,或Aho-Corasick算法。如 与天真的方法相反,这两种算法都表现良好 同样在"输入,寻找100' X'的子串。在一串 10000' X' s。
取自stackoverflow.com/questions/4121875/count-of-substrings-within-string
答案 1 :(得分:0)
一种方法是基本上编写NFA代码(http://en.wikipedia.org/wiki/Nondeterministic_finite_automaton) 并在NFA上运行您的输入。
这是我尝试编码NFA。您可能希望在运行之前先转换为DFA,这样您就不必管理一堆分支。对于分支,它基本上与O(nk)一样慢,而如果转换为DFA则为O(n)
import java.util.*;
public class Test
{
public static void main (String[] args)
{
new Test();
}
private static final String input = "TAAATGGAGGTAATAGAGGAGGTGTAT";
private static final String[] substrings = new String[] { "AA", "AG", "GG", "GAG", "TA" };
private static final int[] occurrences = new int[substrings.length];
public Test()
{
ArrayList<Branch> branches = new ArrayList<Branch>();
// For each character, read it, create branches for each substring, and pass the current character
// to each active branch
for (int i = 0; i < input.length(); i++)
{
char c = input.charAt(i);
// Make a new branch, one for each substring that we are searching for
for (int j = 0; j < substrings.length; j++)
branches.add(new Branch(substrings[j], j, branches));
// Pass the current input character to each branch that is still alive
// Iterate in reverse order because the nextCharacter method may
// cause the branch to be removed from the ArrayList
for (int j = branches.size()-1; j >= 0; j--)
branches.get(j).nextCharacter(c);
}
for (int i = 0; i < occurrences.length; i++)
System.out.println(substrings[i]+": "+occurrences[i]);
}
private static class Branch
{
private String searchFor;
private int position, index;
private ArrayList<Branch> parent;
public Branch(String searchFor, int searchForIndex, ArrayList<Branch> parent)
{
this.parent = parent;
this.searchFor = searchFor;
this.position = 0;
this.index = searchForIndex;
}
public void nextCharacter(char c)
{
// If the current character matches the ith character of the string we are searching for,
// Then this branch will stay alive
if (c == searchFor.charAt(position))
position++;
// Otherwise the substring didn't match, so this branch dies
else
suicide();
// Reached the end of the substring, so the substring was found.
if (position == searchFor.length())
{
occurrences[index] += 1;
suicide();
}
}
private void suicide()
{
parent.remove(this);
}
}
}
此示例的输出是 AA:3 AG:4 GG:4 GAG:3 TA:4
答案 2 :(得分:0)
您想要查找超过1个字符的所有可能的子字符串吗? 在这种情况下,一种方法是使用HashMaps。
此示例输出: {AA = 3,TT = 4,AC = 3,CTT = 2,CAA = 2,GCA = 2,CAC = 2,AG = 3,TTG = 2,AAG = 2,GT = 2,CT = 2,TG = 2,GA = 2,GC = 3,CA = 4}
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
public class Test {
public static void main(String[] args) {
String str = "GCAAAGCTTGGCACACGTCAAGAGTTGACTTT";
HashMap<String, Integer> map = countMatches(str);
System.out.println(map);
}
private static HashMap<String, List<Integer>> findOneLetterMatches(String str) {
ArrayList<Integer> list = new ArrayList<>();
for(int i = 0; i < str.length(); i++) list.add(i);
return extendMatches(str, list, 1);
}
private static HashMap<String, List<Integer>> extendMatches(String str, List<Integer> indices, int targetLength) {
HashMap<String, List<Integer>> map = new HashMap<>();
for(int index: indices) {
if(index+targetLength <= str.length()) {
String s = str.substring(index, index + targetLength);
List<Integer> list = map.get(s);
if(list == null) {
list = new ArrayList<>();
map.put(s, list);
}
list.add(index);
}
}
return map;
}
private static void addIfListLongerThanOne(HashMap<String, List<Integer>> source,
HashMap<String, List<Integer>> target) {
for(Map.Entry<String, List<Integer>> e: source.entrySet()) {
String s = e.getKey();
List<Integer> l = e.getValue();
if(l.size() > 1) target.put(s, l);
}
}
private static HashMap<String, List<Integer>> extendAllMatches(String str, HashMap<String, List<Integer>> map, int targetLength) {
HashMap<String, List<Integer>> result = new HashMap<>();
for(List<Integer> list: map.values()) {
HashMap<String, List<Integer>> m = extendMatches(str, list, targetLength);
addIfListLongerThanOne(m, result);
}
return result;
}
private static HashMap<String, Integer> countMatches(String str) {
HashMap<String, Integer> result = new HashMap<>();
HashMap<String, List<Integer>> matches = findOneLetterMatches(str);
for(int targetLength = 2; !matches.isEmpty(); targetLength++) {
HashMap<String, List<Integer>> m = extendAllMatches(str, matches, targetLength);
for(Map.Entry<String, List<Integer>> e: m.entrySet()) {
String s = e.getKey();
List<Integer> l = e.getValue();
result.put(s, l.size());
}
matches = m;
}
return result;
}
}