识别两个字符串中所有公共子序列的算法

时间:2017-03-05 04:23:57

标签: java algorithm dynamic-programming string-matching

我需要识别给定两个字符串的所有子序列。最长的公共子序列仅识别最长的子序列。但在这里我希望所有子序列都超过阈值。任何特定的算法或方法?

像这样的东西

Julie loves me more than Linda loves me
Jane likes me more than Julie loves me

如果阈值为2,则以下是这两个字符串的公共子序列

me more than
loves me

2 个答案:

答案 0 :(得分:1)

Set<String> allCS;//create an empty set
String[] subStrings = getSubSequences(string2); //find the subsequence of string2(smaller string)
for (String str : subStrings) {
   String lcs = LCS(string1, str);
   if(lcs.length > THRESHOLD) {
       allCS.put(lcs);
   }
}

此处,getSubSequences(String s)返回给定字符串参数的所有子序列,LCS(String s1, String s2)返回s1s2的LCS。

getSubSequences(String s)可以使用位掩码方法或递归方式实现。

LCS(String s1, String s2)可以使用O(n^2)动态编程方法实现,然后在DP表中向后跟踪路径以打印最长的子序列b字符串。

如果较小的字符串非常长,则不会有效,因为可能有2^length(string) - 1个子序列。

答案 1 :(得分:0)

由于这是一个算法问题所以我认为语言并不重要。我的方法是生成这两个字符串之间的所有子序列,并找到超过阈值的子序列。

Python代码(Java不应该更难):

let Given = [
  ["SRM_SaaS_ES,MXASSETInterface,AddChange,EN"],
  ["ASSETNUM,AS_SITEID,apple,ball"],
  ["mesa01,SDASITE,ball,cat"],
  ["ASSETNUM,AS_SITEID,cat,ager"]
];

// first get the keys out of the first sub array:
const keys = Given[0][0].split(",");

// then map over the rest of the sub arrays:
const result = Given.slice(1).map(function(item) {
  // get values from current item
  const values = item[0].split(",");
  // create an object with key names and item values:
  const obj = {};
  keys.forEach(function(k,i) {
    obj[k] = values[i];
  });
  return obj;
});

console.log(result);

所有common_subsequences:

def common_subsequences(a, b, threshold):
    # tokenize two string (keep order)
    tokens_a = a.split()
    tokens_b = b.split()
    # store all common subsequences
    common = set()
    # with each token in a
    for i, token in enumerate(tokens_a):
        # if it also appears in b
        # then this should be a starting point for a common subsequence
        if token in tokens_b:
            # get the first occurence of token in b
            # and start from there
            j = tokens_b.index(token)
            k = i
            temp = token
            # since we need all subsequences, we get length-1 subsequences too 
            common.add(temp)
            # while still have token in common
            while j < len(tokens_b) and k < len(tokens_a):
                if j + 1 < len(tokens_b) and k + 1 < len(tokens_a) and tokens_b[j+1] == tokens_a[k+1]:
                    temp += " " + tokens_b[j+1]
                    j += 1
                    k += 1
                    # adding (new) common subsequences
                    common.add(temp)
                # or else we break
                else:
                    break
    # we only get the ones having length >= threshold
    return [s for s in common if len(s.split()) >= threshold]

a = "Julie loves me more than Linda loves me"
b = "Jane likes me more than Julie loves me"
print common_subsequences(a, b, 2)

common_subsequences&gt; = threshold:

set(['me', 'more than', 'Julie', 'Julie loves', 'Julie loves me', 'me more', 'loves', 'more', 'than', 'me more than', 'loves me'])