我需要识别给定两个字符串的所有子序列。最长的公共子序列仅识别最长的子序列。但在这里我希望所有子序列都超过阈值。任何特定的算法或方法?
像这样的东西
Julie loves me more than Linda loves me
Jane likes me more than Julie loves me
如果阈值为2,则以下是这两个字符串的公共子序列
me more than
loves me
答案 0 :(得分:1)
Set<String> allCS;//create an empty set
String[] subStrings = getSubSequences(string2); //find the subsequence of string2(smaller string)
for (String str : subStrings) {
String lcs = LCS(string1, str);
if(lcs.length > THRESHOLD) {
allCS.put(lcs);
}
}
此处,getSubSequences(String s)
返回给定字符串参数的所有子序列,LCS(String s1, String s2)
返回s1
和s2
的LCS。
getSubSequences(String s)
可以使用位掩码方法或递归方式实现。
LCS(String s1, String s2)
可以使用O(n^2)
动态编程方法实现,然后在DP表中向后跟踪路径以打印最长的子序列b字符串。
如果较小的字符串非常长,则不会有效,因为可能有2^length(string) - 1
个子序列。
答案 1 :(得分:0)
由于这是一个算法问题所以我认为语言并不重要。我的方法是生成这两个字符串之间的所有子序列,并找到超过阈值的子序列。
Python代码(Java不应该更难):
let Given = [
["SRM_SaaS_ES,MXASSETInterface,AddChange,EN"],
["ASSETNUM,AS_SITEID,apple,ball"],
["mesa01,SDASITE,ball,cat"],
["ASSETNUM,AS_SITEID,cat,ager"]
];
// first get the keys out of the first sub array:
const keys = Given[0][0].split(",");
// then map over the rest of the sub arrays:
const result = Given.slice(1).map(function(item) {
// get values from current item
const values = item[0].split(",");
// create an object with key names and item values:
const obj = {};
keys.forEach(function(k,i) {
obj[k] = values[i];
});
return obj;
});
console.log(result);
所有common_subsequences:
def common_subsequences(a, b, threshold):
# tokenize two string (keep order)
tokens_a = a.split()
tokens_b = b.split()
# store all common subsequences
common = set()
# with each token in a
for i, token in enumerate(tokens_a):
# if it also appears in b
# then this should be a starting point for a common subsequence
if token in tokens_b:
# get the first occurence of token in b
# and start from there
j = tokens_b.index(token)
k = i
temp = token
# since we need all subsequences, we get length-1 subsequences too
common.add(temp)
# while still have token in common
while j < len(tokens_b) and k < len(tokens_a):
if j + 1 < len(tokens_b) and k + 1 < len(tokens_a) and tokens_b[j+1] == tokens_a[k+1]:
temp += " " + tokens_b[j+1]
j += 1
k += 1
# adding (new) common subsequences
common.add(temp)
# or else we break
else:
break
# we only get the ones having length >= threshold
return [s for s in common if len(s.split()) >= threshold]
a = "Julie loves me more than Linda loves me"
b = "Jane likes me more than Julie loves me"
print common_subsequences(a, b, 2)
common_subsequences&gt; = threshold:
set(['me', 'more than', 'Julie', 'Julie loves', 'Julie loves me', 'me more', 'loves', 'more', 'than', 'me more than', 'loves me'])