没有剪切word-python的最长公共子串

时间:2014-03-29 02:15:50

标签: python string nlp substring longest-substring

鉴于以下内容,我可以找到最长的公共子字符串:

s1 = "this is a foo bar sentence ."
s2 = "what the foo bar blah blah black sheep is doing ?"

def longest_common_substring(s1, s2):
  m = [[0] * (1 + len(s2)) for i in xrange(1 + len(s1))]
  longest, x_longest = 0, 0
  for x in xrange(1, 1 + len(s1)):
    for y in xrange(1, 1 + len(s2)):
      if s1[x - 1] == s2[y - 1]:
        m[x][y] = m[x - 1][y - 1] + 1
        if m[x][y] > longest:
          longest = m[x][y]
          x_longest = x
      else:
        m[x][y] = 0
  return s1[x_longest - longest: x_longest]

print longest_common_substring(s1, s2)

[OUT]:

foo bar

但是,我如何确保最长的共同子字符串尊重英语单词边界并且不要删除单词?例如,以下句子:

s1 = "this is a foo bar sentence ."
s2 = "what a kappa foo bar black sheep ?"
print longest_common_substring(s1, s2)

输出 NOT 所需的跟随,因为它会从s2中删除单词kappa

a foo bar

所需的输出仍然是:

foo bar

我还尝试了一种获取最长公共子字符串的ngram方法,但还有其他处理字符串的方法而不计算ngrams 吗? (见答案)

9 个答案:

答案 0 :(得分:8)

这太容易理解了。我用你的代码完成了75%的工作。 我首先将句子分成单词,然后将其传递给你的函数以获得最大的公共子字符串(在这种情况下它将是最长的连续单词),所以你的函数给了我[' foo',&#39 ; bar'],我加入该数组的元素以产生所需的结果。

以下是在线工作副本,供您测试,验证并摆弄它。

http://repl.it/RU0/1

def longest_common_substring(s1, s2):
  m = [[0] * (1 + len(s2)) for i in xrange(1 + len(s1))]
  longest, x_longest = 0, 0
  for x in xrange(1, 1 + len(s1)):
    for y in xrange(1, 1 + len(s2)):
      if s1[x - 1] == s2[y - 1]:
        m[x][y] = m[x - 1][y - 1] + 1
        if m[x][y] > longest:
          longest = m[x][y]
          x_longest = x
      else:
        m[x][y] = 0
  return s1[x_longest - longest: x_longest]

def longest_common_sentence(s1, s2):
    s1_words = s1.split(' ')
    s2_words = s2.split(' ')  
    return ' '.join(longest_common_substring(s1_words, s2_words))


s1 = 'this is a foo bar sentence .'
s2 = 'what a kappa foo bar black sheep ?'
common_sentence = longest_common_sentence(s1, s2)
print common_sentence
>> 'foo bar'

边缘情况

  1. ''和'?'如果在最后一个单词和标点符号之间有空格,也会被视为有效单词。如果你不留空间,他们将被算作最后一个字的一部分。在那种情况下,绵羊'和绵羊?'不再是同一个词了。在调用此类函数之前,由您决定如何处理此类字符。在那种情况下

    import re
    s1 = re.sub('[.?]','', s1)
    s2 = re.sub('[.?]','', s2)

  2. 然后像往常一样继续。

答案 1 :(得分:1)

我的回答并非来自任何官方消息来源,只是一个简单的观察:至少在我的安装中,你的LCS函数的输出与它(s1,s2)和(s1)之间存在差异,s3):

In [1]: s1 = "this is a foo bar sentence ."

In [3]: s2 = "what the foo bar blah blah black sheep is doing ?"

In [4]: s3 = "what a kappa foo bar black sheep ?"

In [12]: longest_common_substring(s1, s3)
Out[12]: 'a foo bar '

In [13]: longest_common_substring(s1, s2)
Out[13]: ' foo bar '

正如您所注意到的,如果完整的单词匹配,那么周围的空格也匹配

然后,您可以在返回输出之前修改该函数,如下所示:

answer = s1[x_longest - longest: x_longest]
if not (answer.startswith(" ") and answer.endswith(" ")):
    return longest_common_substring(s1, answer[1:])
else:
    return answer

我确信还有其他边缘情况,例如出现在字符串末尾的子字符串,以s1s2递归调用函数,是否修剪answer正面或背面,以及其他 - 但至少在你展示的情况下,这个简单的修改做你想要的:

In [20]: longest_common_substring(s1, s3)
Out[20]: ' foo bar '

你认为这个方向值得探讨吗?

答案 2 :(得分:1)

只需在代码中添加接受条件:

s1 = "this is a foo bar sentence ."
s2 = "what the foo bar blah blah black sheep is doing ?"

def longest_common_substring(s1, s2):
  m = [[0] * (1 + len(s2)) for i in xrange(1 + len(s1))]
  longest, x_longest = 0, 0
  for x in xrange(1, 1 + len(s1)):
    for y in xrange(1, 1 + len(s2)):
      if s1[x - 1] == s2[y - 1]:
        m[x][y] = m[x - 1][y - 1] + 1
        if m[x][y] > longest and word_aligned(x, y, m[x][y]):  # acceptance condition
          longest = m[x][y]
          x_longest = x
      else:
        m[x][y] = 0
  return s1[x_longest - longest: x_longest]

def word_aligned(x, y, length):
    """check that a match starting at s1[x - 1] and s2[y - 1] is aligned on a word boundary"""
    # check start of match in s1
    if s1[x - 1].isspace():
        # match doesn't start with a character, reject
        return False
    if x - 2 > 0 and not s1[x - 2].isspace():
        # char before match is not start of line or space, reject
        return False
    # check start of match in s2
    ... same as above ...
    # check end of match in s1
    ... your code is a bit hard for me follow, what is end of match? ...
    # check end of match in s2
    ... same as above ...
    return True

print longest_common_substring(s1, s2)

答案 3 :(得分:1)

这是一个有趣的问题,然后我最初赞美它。当你考虑它时,有4种可能的结果。

  1. 琐碎的情况,整个字符串没有边界(你的第一个例子)
  2. 在开头跨越一个单词边界(第二个例子)
  3. 在最后跨越一个字边界
  4. 每端都有一个单词边界
  5. 现在你的代码处理了一些小问题,所以我们可以利用它;剩下的就是将结果包装在其他案例的几个检查中。那么这些检查应该是什么样的呢?让我们来看看你的失败案例:

    string 1 = "this is a foo bar sentence ."
    string 2 = "what a kappa foo bar black sheep ?"
    output string = "a foo bar"
    

    因此,从字符串find的角度来看,我们可以string1string2中按顺序查找所有这些字母,但是如果我们将所有内容分开在空格周围进入列表,并按顺序查找列表string1将匹配。

    现在我主要是一个C家伙,所以我想在一个函数中写这个:

    def full_string(str1, str2, chkstr):
      l1 = str1.split()
      l2 = str2.split()
      chkl = chkstr.split()
      return (any(l1[i:i+len(chkl)]==chkl for i in xrange(len(l1)-len(chkl)+1)) and
              any(l2[i:i+len(chkl)]==chkl for i in xrange(len(l2)-len(chkl)+1)))
    

    使用此函数,我们可以检查两个字符串中的 是否按顺序包含longest_common_substring(s1, s2)的结果中的所有单词。完善。所以最后一步是结合这两个函数并检查上面列出的4种情况中的每一种:

    def longest_whole_substring(s1, s2):
      subs = longest_common_substring(s1, s2)
      if not full_string(s1, s2, subs):
        if full_string(s1, s2, ' '.join(subs.split()[1:])):
          subs = ' '.join(subs.split()[1:])
        elif full_string(s1, s2, ' '.join(subs.split()[:-1])):
          subs = ' '.join(subs.split()[:-1])
        else:
          subs = ' '.join(subs.split()[1:-1])
      return subs
    

    现在函数longest_whole_substring(s1, s2)将提供最长的整个子字符串,而不是切断任何单词。让我们在每个案例中测试一下:

    琐碎:

    >>> a = 'this is a foo bar bar foo string'
    >>> b = 'foo bar'
    >>> 
    >>> longest_whole_substring(a,b)
    'foo bar'
    

    开头的字边界:

    >>> b = 's a foo bar'
    >>> 
    >>> longest_whole_substring(a,b)
    'a foo bar '
    

    最后的单词边界:

    >>> b = 'foo bar f'
    >>> 
    >>> longest_whole_substring(a,b)
    'foo bar'
    

    两端都有一个词界:

    >>> b = 's a foo bar f'
    >>> 
    >>> longest_whole_substring(a,b)
    'a foo bar'
    

    找钱'好!

答案 4 :(得分:1)

您需要做的就是添加对单词开头和结尾的检查。

然后,只有有效的匹配结束才会更新m

像这样:

def longest_common_substring(s1, s2):
  m = [[0] * (1 + len(s2)) for i in xrange(1 + len(s1))]
  longest, x_longest = 0, 0
  for x in xrange(1, 1 + len(s1)):
    # current character in s1
    x_char = s1[x - 1]
    # we are at the beginning of a word in s1 if
    #   (we are at the beginning of s1) or 
    #   (previous character is a space)
    x_word_begin = (x == 1) or (s1[x - 2] == " ")
    # we are at the end of a word in s1 if
    #   (we are at the end of s1) or 
    #   (next character is a space)
    x_word_end = (x == len(s1)) or (s1[x] == " ")
    for y in xrange(1, 1 + len(s2)):
      # current character in s2
      y_char = s2[y - 1]
      # we are at the beginning of a word in s2 if
      #   (we are at the beginning of s2) or 
      #   (previous character is a space)
      y_word_begin = (y == 1) or (s2[y - 2] == " ")
      # we are at the end of a word in s2 if
      #   (we are at the end of s2) or 
      #   (next character is a space)
      y_word_end = (y == len(s2)) or (s2[y] == " ")
      if x_char == y_char:
        # no match starting with x_char
        if m[x - 1][y - 1] == 0:
          # a match can start only with a space
          #   or at the beginning of a word
          if x_char == " " or (x_word_begin and y_word_begin):
              m[x][y] = m[x - 1][y - 1] + 1
        else:
          m[x][y] = m[x - 1][y - 1] + 1
        if m[x][y] > longest:
          # the match can end only with a space
          #   or at the end of a word
          if x_char == " " or (x_word_end and y_word_end):
            longest = m[x][y]
            x_longest = x
      else:
        m[x][y] = 0
  return s1[x_longest - longest: x_longest]

答案 5 :(得分:1)

我递归地做了:

def common_phrase(self, longer, shorter):
""" recursively find longest common substring, consists of whole words only and in the same order """
if shorter in longer:
    return shorter
elif len(shorter.split()) > 1:
    common_phrase_without_last_word = common_phrase(shorter.rsplit(' ', 1)[0], longer)
    common_phrase_without_first_word = common_phrase(shorter.split(' ', 1)[1], longer)
    without_first_is_longer = len(common_phrase_without_last_word) < len(common_phrase_without_first_word)

    return ((not without_first_is_longer) * common_phrase_without_last_word +
            without_first_is_longer * common_phrase_without_first_word)
else:
    return ''

只需将两个字符串分类为“更短”字符串即可。并且“更久”&#39;在申请之前:

if len(str1) > len(str2):
    longer, shorter = str1, str2 
else:
    longer, shorter = str2, str1

答案 6 :(得分:0)

这是一种ngram方式:

def ngrams(text, n):
  return [text[i:i+n] for i in xrange(len(text)-n)]

def longest_common_ngram(s1, s2):
  s1ngrams = list(chain(*[[" ".join(j) for j in ngrams(s1.split(), i)] 
                          for i in range(1, len(s1.split()))]))
  s2ngrams = list(chain(*[[" ".join(j) for j in ngrams(s2.split(), i)]
                          for i in range(1, len(s2.split()))]))

  return set(s1ngrams).intersection(set(s2ngrams))

答案 7 :(得分:0)

查找最长公共子串的一种有效方法是后缀树(请参阅http://en.wikipedia.org/wiki/Suffix_treehttp://en.wikipedia.org/wiki/Longest_common_substring_problem)。我没有看到任何使用单词而不是字符创建后缀树的原因,在这种情况下,从树中提取的最长公共子序列将尊重标记边界。如果要在一个固定字符串和大量其他字符串之间找到公共子字符串,这种方法将特别有效。

有关Python后缀树实现的列表,请参阅python: library for generalized suffix trees的已接受答案。

答案 8 :(得分:0)

from difflib import SequenceMatcher
def longest_substring(str1, str2):
    # initialize SequenceMatcher object with
    # input string
    # below logic is to make sure word does not get cut
    str1 = " " + str1.strip() + " "
    str2 = " " + str2.strip() + " "
    seq_match = SequenceMatcher(None, str1, str2)

    # find match of longest sub-string
    # output will be like Match(a=0, b=0, size=5)
    match = seq_match.find_longest_match(0, len(str1), 0, len(str2))

    # return longest substring
    if match.size != 0:
        lm = str1[match.a: match.a + match.size]
        # below logic is to make sure word does not get cut
        if not lm.startswith(" "):
            while not (lm.startswith(" ") or len(lm) == 0):
                lm = lm[1:]
        if not lm.endswith(" "):
            while not (lm.endswith(" ") or len(lm) == 0):
                lm = lm[:-1]
        return lm.strip()
    else:
        return ""