Question

我想构建一个String集合（任何复杂的数据结构，如集合），我可以高效地使用它作为“示例”来知道我可以在哪里拆分给定的字符串。
在示例中，我有这个String集合：

abaco代码，交换。
粗体字可以粗体。
树文件夹和叶子树。

和给定的字符串：

omecodeexchangeuthercanbetreeofword

从算法中获取类似的内容：

ome code exchange uther可以是单词

部分“ome”和“uther”不能分割，因此将保持原样（如果我将此部分标记为NOT-RECOGNIZED，那将是很好的）。我尝试分析KMP算法，但距离我的需求太远了，我想以有效的时间方式组织集合（小于线性到集合大小）。

我忘了说：

分裂是字符串，自然语言单词与俚语单词混合，无空格
我已经尝试过基于加权单词字典的动态算法，但是对于错误分割的等效权重错误主题太多了（“错误”意味着自然语言）
我需要这个分割的最佳结果，使用字符串集合中的单词序列作为“好例子”

Answer 1

Dynamic Programming可以在这里提供帮助。

f(0) = 0
f(i) = min { f(j) + (dictionary.contains(word.substring(j,i)) ? 0 : i-j)  for each j=0,...,i }

这个想法是使用上面的递归函数进行详尽的搜索，同时尽量减少不适合的字母数量。使用DP技术，您可以避免重复计算并有效地获得正确的答案。

获取实际分区可以通过记住每个步骤选择j，并从最后到第一步回溯您的步骤来完成。

Java代码：

    String word = "omecodeexchangeuthercanbetreeofword";
    Set<String> set = new HashSet<>(Arrays.asList("abaco", "code", "exchange", "bold", "word", "can", "be", "tree", "folder", "and", "of", "leaf"));
    int n = word.length() + 1;
    int[] f = new int[n];
    int[] jChoices = new int[n];
    f[0] = 0;
    for (int i = 1; i < n; i++) {
        int best = Integer.MAX_VALUE;
        int bestJ = -1;
        for (int j = 0; j < i; j++) {
            int curr = f[j] + (set.contains(word.substring(j, i)) ? 0 : (i-j));
            if (curr < best) {
                best = curr;
                bestJ = j;
            }
        }
        jChoices[i] = bestJ;
        f[i] = best;
    }
    System.out.println("unmatched chars: " + f[n-1]);
    System.out.println("split:");
    int j = n-1;
    List<String> splits = new ArrayList<>();
    while (j > 0) { 
        splits.add(word.substring(jChoices[j],j));
        j = jChoices[j];
    }
    Collections.reverse(splits);
    for (String s : splits) System.out.println(s + " " + (set.contains(s)?"(match)":"(does not match)"));

Answer 2

这可以使用正则表达式轻松完成，正则表达式针对性能进行了高度优化。

public static void main(String[] args) {
    List<String> splitWords = Arrays.asList("abaco", "code", "exchange", "bold", "word", "can", "be", "tree", "folder", "and", "of", "leaf");

    String splitRegex = "";
    for (int i = 0; i < splitWords.size(); i++) {
        if (i > 0)
            splitRegex += "|";
        splitRegex += splitWords.get(i);
    }

    String stringToSplit = "omecodeexchangeuthercanbetreeofword";

    Pattern pattern = Pattern.compile(splitRegex);
    Matcher matcher = pattern.matcher(stringToSplit);

    int previousMatchEnd = 0;
    while (matcher.find()) {
        int matchStart = matcher.start();
        int matchEnd = matcher.end();

        if (matchStart != previousMatchEnd)
            System.out.println("Not recognized: " + stringToSplit.substring(previousMatchEnd, matchStart));

        System.out.println("Match: " + stringToSplit.substring(matchStart, matchEnd));
        previousMatchEnd = matchEnd;
    }

    if (previousMatchEnd != stringToSplit.length())
        System.out.println("Not recognized: " + stringToSplit.substring(previousMatchEnd, stringToSplit.length()));
}

输出：

Not recognized: ome
Match: code
Match: exchange
Not recognized: uther
Match: can
Match: be
Match: tree
Match: of
Match: word

从其他字符串集合的示例中拆分字符串

2 个答案: