扫描字母组合的最佳方式

时间:2014-04-21 03:07:45

标签: java

所以,让我说我有一个32个字符的字符串:

GCAAAGCTTGGCACACGTCAAGAGTTGACTTT

我的目标是计算所有特定子串的出现次数,例如' AA' ' ATT' ' CGG'等等。为此,上面的第3到第5个字符包含2次出现的AA'。这些子串中总共有8个,其中6个长度为3个字符,2个长度为2个字符,我希望所有8个都有计数。

在Java中执行此操作的最有效方法是什么?我的想法有几行:

  1. 逐字符扫描,检查并标记每个子字符串。这似乎是密集而低效的。
  2. 找到一些可以完成工作的现有函数(不确定函数的效率,String.contains是一个布尔值,而不是一个计数)。
  3. 多次扫描字符串,每次扫描检查不同的子字符串。
  4. 3的实现是微不足道的,但是1可能会带来一些额外的麻烦,并且不会是非常干净的代码。

3 个答案:

答案 0 :(得分:0)

我认为这应该回答你的问题。

  

天真的方法(检查每个可能索引的子字符串)   在O(nk)中运行,其中n是字符串的长度,k是长度   子串。这可以用for循环实现,和   像haystack.substring(i).startsWith(needle)。

     

虽然存在更高效的算法。你可能想看看   Knuth-Morris-Pratt算法,或Aho-Corasick算法。如   与天真的方法相反,这两种算法都表现良好   同样在"输入,寻找100' X'的子串。在一串   10000' X' s。

取自stackoverflow.com/questions/4121875/count-of-substrings-within-string

答案 1 :(得分:0)

一种方法是基本上编写NFA代码(http://en.wikipedia.org/wiki/Nondeterministic_finite_automaton) 并在NFA上运行您的输入。

这是我尝试编码NFA。您可能希望在运行之前先转换为DFA,这样您就不必管理一堆分支。对于分支,它基本上与O(nk)一样慢,而如果转换为DFA则为O(n)

import java.util.*;

public class Test
{
    public static void main (String[] args)
    {
        new Test();
    }

    private static final String input = "TAAATGGAGGTAATAGAGGAGGTGTAT";
    private static final String[] substrings = new String[] { "AA", "AG", "GG", "GAG", "TA" };
    private static final int[] occurrences = new int[substrings.length];

    public Test()
    {
        ArrayList<Branch> branches = new ArrayList<Branch>();

        //  For each character, read it, create branches for each substring, and pass the current character
        //  to each active branch
        for (int i = 0; i < input.length(); i++)
        {
            char c = input.charAt(i);

            //  Make a new branch, one for each substring that we are searching for
            for (int j = 0; j < substrings.length; j++)
                branches.add(new Branch(substrings[j], j, branches));

            //  Pass the current input character to each branch that is still alive
            //  Iterate in reverse order because the nextCharacter method may
            //  cause the branch to be removed from the ArrayList
            for (int j = branches.size()-1; j >= 0; j--)
                branches.get(j).nextCharacter(c);
        }

        for (int i = 0; i < occurrences.length; i++)
            System.out.println(substrings[i]+": "+occurrences[i]);
    }

    private static class Branch
    {
        private String searchFor;
        private int position, index;
        private ArrayList<Branch> parent;

        public Branch(String searchFor, int searchForIndex, ArrayList<Branch> parent)
        {
            this.parent = parent;
            this.searchFor = searchFor;
            this.position = 0;
            this.index = searchForIndex;
        }

        public void nextCharacter(char c)
        {
            //  If the current character matches the ith character of the string we are searching for,
            //  Then this branch will stay alive
            if (c == searchFor.charAt(position))
                position++;
            //  Otherwise the substring didn't match, so this branch dies
            else
                suicide();

            //  Reached the end of the substring, so the substring was found.
            if (position == searchFor.length())
            {
                occurrences[index] += 1;
                suicide();
            }
        }

        private void suicide()
        {
            parent.remove(this);
        }
    }
}

此示例的输出是 AA:3 AG:4 GG:4 GAG:3 TA:4

答案 2 :(得分:0)

您想要查找超过1个字符的所有可能的子字符串吗? 在这种情况下,一种方法是使用HashMaps。

此示例输出: {AA = 3,TT = 4,AC = 3,CTT = 2,CAA = 2,GCA = 2,CAC = 2,AG = 3,TTG = 2,AAG = 2,GT = 2,CT = 2,TG = 2,GA = 2,GC = 3,CA = 4}

import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;

public class Test {
    public static void main(String[] args) {
        String str = "GCAAAGCTTGGCACACGTCAAGAGTTGACTTT";
        HashMap<String, Integer> map = countMatches(str);
        System.out.println(map);
    }

    private static HashMap<String, List<Integer>> findOneLetterMatches(String str) {
        ArrayList<Integer> list = new ArrayList<>();
        for(int i = 0; i < str.length(); i++) list.add(i);
        return extendMatches(str, list, 1);
    }

    private static HashMap<String, List<Integer>> extendMatches(String str, List<Integer> indices, int targetLength) {
        HashMap<String, List<Integer>> map = new HashMap<>();
        for(int index: indices) {
            if(index+targetLength <= str.length()) {
                String s = str.substring(index, index + targetLength);
                List<Integer> list = map.get(s);
                if(list == null) {
                    list = new ArrayList<>();
                    map.put(s, list);
                }
                list.add(index);
            }
        }
        return map;
    }

    private static void addIfListLongerThanOne(HashMap<String, List<Integer>> source,
                                               HashMap<String, List<Integer>> target) {
        for(Map.Entry<String, List<Integer>> e: source.entrySet()) {
            String s = e.getKey();
            List<Integer> l = e.getValue();
            if(l.size() > 1) target.put(s, l);
        }
    }

    private static HashMap<String, List<Integer>> extendAllMatches(String str, HashMap<String, List<Integer>> map, int targetLength) {
        HashMap<String, List<Integer>> result = new HashMap<>();
        for(List<Integer> list: map.values()) {
            HashMap<String, List<Integer>> m = extendMatches(str, list, targetLength);
            addIfListLongerThanOne(m, result);
        }
        return result;
    }

    private static HashMap<String, Integer> countMatches(String str) {
        HashMap<String, Integer> result = new HashMap<>();
        HashMap<String, List<Integer>> matches = findOneLetterMatches(str);
        for(int targetLength = 2; !matches.isEmpty(); targetLength++) {
            HashMap<String, List<Integer>> m = extendAllMatches(str, matches, targetLength);
            for(Map.Entry<String, List<Integer>> e: m.entrySet()) {
                String s = e.getKey();
                List<Integer> l = e.getValue();
                result.put(s, l.size());
            }
            matches = m;
        }
        return result;
    }
}