我想构建一个String集合(任何复杂的数据结构,如集合),我可以高效地使用它作为“示例”来知道我可以在哪里拆分给定的字符串。
在示例中,我有这个String集合:
和给定的字符串:
从算法中获取类似的内容:
部分“ome”和“uther”不能分割,因此将保持原样(如果我将此部分标记为NOT-RECOGNIZED,那将是很好的)。 我尝试分析KMP算法,但距离我的需求太远了,我想以有效的时间方式组织集合(小于线性到集合大小)。
我忘了说:
答案 0 :(得分:1)
Dynamic Programming可以在这里提供帮助。
f(0) = 0
f(i) = min { f(j) + (dictionary.contains(word.substring(j,i)) ? 0 : i-j) for each j=0,...,i }
这个想法是使用上面的递归函数进行详尽的搜索,同时尽量减少不适合的字母数量。使用DP技术,您可以避免重复计算并有效地获得正确的答案。
获取实际分区可以通过记住每个步骤选择j
,并从最后到第一步回溯您的步骤来完成。
Java代码:
String word = "omecodeexchangeuthercanbetreeofword";
Set<String> set = new HashSet<>(Arrays.asList("abaco", "code", "exchange", "bold", "word", "can", "be", "tree", "folder", "and", "of", "leaf"));
int n = word.length() + 1;
int[] f = new int[n];
int[] jChoices = new int[n];
f[0] = 0;
for (int i = 1; i < n; i++) {
int best = Integer.MAX_VALUE;
int bestJ = -1;
for (int j = 0; j < i; j++) {
int curr = f[j] + (set.contains(word.substring(j, i)) ? 0 : (i-j));
if (curr < best) {
best = curr;
bestJ = j;
}
}
jChoices[i] = bestJ;
f[i] = best;
}
System.out.println("unmatched chars: " + f[n-1]);
System.out.println("split:");
int j = n-1;
List<String> splits = new ArrayList<>();
while (j > 0) {
splits.add(word.substring(jChoices[j],j));
j = jChoices[j];
}
Collections.reverse(splits);
for (String s : splits) System.out.println(s + " " + (set.contains(s)?"(match)":"(does not match)"));
答案 1 :(得分:0)
这可以使用正则表达式轻松完成,正则表达式针对性能进行了高度优化。
public static void main(String[] args) {
List<String> splitWords = Arrays.asList("abaco", "code", "exchange", "bold", "word", "can", "be", "tree", "folder", "and", "of", "leaf");
String splitRegex = "";
for (int i = 0; i < splitWords.size(); i++) {
if (i > 0)
splitRegex += "|";
splitRegex += splitWords.get(i);
}
String stringToSplit = "omecodeexchangeuthercanbetreeofword";
Pattern pattern = Pattern.compile(splitRegex);
Matcher matcher = pattern.matcher(stringToSplit);
int previousMatchEnd = 0;
while (matcher.find()) {
int matchStart = matcher.start();
int matchEnd = matcher.end();
if (matchStart != previousMatchEnd)
System.out.println("Not recognized: " + stringToSplit.substring(previousMatchEnd, matchStart));
System.out.println("Match: " + stringToSplit.substring(matchStart, matchEnd));
previousMatchEnd = matchEnd;
}
if (previousMatchEnd != stringToSplit.length())
System.out.println("Not recognized: " + stringToSplit.substring(previousMatchEnd, stringToSplit.length()));
}
输出:
Not recognized: ome
Match: code
Match: exchange
Not recognized: uther
Match: can
Match: be
Match: tree
Match: of
Match: word