
时间:2014-03-29 02:15:50

标签: python string nlp substring longest-substring


s1 = "this is a foo bar sentence ."
s2 = "what the foo bar blah blah black sheep is doing ?"

def longest_common_substring(s1, s2):
  m = [[0] * (1 + len(s2)) for i in xrange(1 + len(s1))]
  longest, x_longest = 0, 0
  for x in xrange(1, 1 + len(s1)):
    for y in xrange(1, 1 + len(s2)):
      if s1[x - 1] == s2[y - 1]:
        m[x][y] = m[x - 1][y - 1] + 1
        if m[x][y] > longest:
          longest = m[x][y]
          x_longest = x
        m[x][y] = 0
  return s1[x_longest - longest: x_longest]

print longest_common_substring(s1, s2)


foo bar


s1 = "this is a foo bar sentence ."
s2 = "what a kappa foo bar black sheep ?"
print longest_common_substring(s1, s2)

输出 NOT 所需的跟随,因为它会从s2中删除单词kappa

a foo bar


foo bar

我还尝试了一种获取最长公共子字符串的ngram方法,但还有其他处理字符串的方法而不计算ngrams 吗? (见答案)

9 个答案:

答案 0 :(得分:8)

这太容易理解了。我用你的代码完成了75%的工作。 我首先将句子分成单词,然后将其传递给你的函数以获得最大的公共子字符串(在这种情况下它将是最长的连续单词),所以你的函数给了我[' foo',&#39 ; bar'],我加入该数组的元素以产生所需的结果。



def longest_common_substring(s1, s2):
  m = [[0] * (1 + len(s2)) for i in xrange(1 + len(s1))]
  longest, x_longest = 0, 0
  for x in xrange(1, 1 + len(s1)):
    for y in xrange(1, 1 + len(s2)):
      if s1[x - 1] == s2[y - 1]:
        m[x][y] = m[x - 1][y - 1] + 1
        if m[x][y] > longest:
          longest = m[x][y]
          x_longest = x
        m[x][y] = 0
  return s1[x_longest - longest: x_longest]

def longest_common_sentence(s1, s2):
    s1_words = s1.split(' ')
    s2_words = s2.split(' ')  
    return ' '.join(longest_common_substring(s1_words, s2_words))

s1 = 'this is a foo bar sentence .'
s2 = 'what a kappa foo bar black sheep ?'
common_sentence = longest_common_sentence(s1, s2)
print common_sentence
>> 'foo bar'


  1. ''和'?'如果在最后一个单词和标点符号之间有空格,也会被视为有效单词。如果你不留空间,他们将被算作最后一个字的一部分。在那种情况下,绵羊'和绵羊?'不再是同一个词了。在调用此类函数之前,由您决定如何处理此类字符。在那种情况下

    import re
    s1 = re.sub('[.?]','', s1)
    s2 = re.sub('[.?]','', s2)

  2. 然后像往常一样继续。

答案 1 :(得分:1)


In [1]: s1 = "this is a foo bar sentence ."

In [3]: s2 = "what the foo bar blah blah black sheep is doing ?"

In [4]: s3 = "what a kappa foo bar black sheep ?"

In [12]: longest_common_substring(s1, s3)
Out[12]: 'a foo bar '

In [13]: longest_common_substring(s1, s2)
Out[13]: ' foo bar '



answer = s1[x_longest - longest: x_longest]
if not (answer.startswith(" ") and answer.endswith(" ")):
    return longest_common_substring(s1, answer[1:])
    return answer

我确信还有其他边缘情况,例如出现在字符串末尾的子字符串,以s1s2递归调用函数,是否修剪answer正面或背面,以及其他 - 但至少在你展示的情况下,这个简单的修改做你想要的:

In [20]: longest_common_substring(s1, s3)
Out[20]: ' foo bar '


答案 2 :(得分:1)


s1 = "this is a foo bar sentence ."
s2 = "what the foo bar blah blah black sheep is doing ?"

def longest_common_substring(s1, s2):
  m = [[0] * (1 + len(s2)) for i in xrange(1 + len(s1))]
  longest, x_longest = 0, 0
  for x in xrange(1, 1 + len(s1)):
    for y in xrange(1, 1 + len(s2)):
      if s1[x - 1] == s2[y - 1]:
        m[x][y] = m[x - 1][y - 1] + 1
        if m[x][y] > longest and word_aligned(x, y, m[x][y]):  # acceptance condition
          longest = m[x][y]
          x_longest = x
        m[x][y] = 0
  return s1[x_longest - longest: x_longest]

def word_aligned(x, y, length):
    """check that a match starting at s1[x - 1] and s2[y - 1] is aligned on a word boundary"""
    # check start of match in s1
    if s1[x - 1].isspace():
        # match doesn't start with a character, reject
        return False
    if x - 2 > 0 and not s1[x - 2].isspace():
        # char before match is not start of line or space, reject
        return False
    # check start of match in s2
    ... same as above ...
    # check end of match in s1
    ... your code is a bit hard for me follow, what is end of match? ...
    # check end of match in s2
    ... same as above ...
    return True

print longest_common_substring(s1, s2)

答案 3 :(得分:1)


  1. 琐碎的情况,整个字符串没有边界(你的第一个例子)
  2. 在开头跨越一个单词边界(第二个例子)
  3. 在最后跨越一个字边界
  4. 每端都有一个单词边界
  5. 现在你的代码处理了一些小问题,所以我们可以利用它;剩下的就是将结果包装在其他案例的几个检查中。那么这些检查应该是什么样的呢?让我们来看看你的失败案例:

    string 1 = "this is a foo bar sentence ."
    string 2 = "what a kappa foo bar black sheep ?"
    output string = "a foo bar"



    def full_string(str1, str2, chkstr):
      l1 = str1.split()
      l2 = str2.split()
      chkl = chkstr.split()
      return (any(l1[i:i+len(chkl)]==chkl for i in xrange(len(l1)-len(chkl)+1)) and
              any(l2[i:i+len(chkl)]==chkl for i in xrange(len(l2)-len(chkl)+1)))

    使用此函数,我们可以检查两个字符串中的 是否按顺序包含longest_common_substring(s1, s2)的结果中的所有单词。完善。所以最后一步是结合这两个函数并检查上面列出的4种情况中的每一种:

    def longest_whole_substring(s1, s2):
      subs = longest_common_substring(s1, s2)
      if not full_string(s1, s2, subs):
        if full_string(s1, s2, ' '.join(subs.split()[1:])):
          subs = ' '.join(subs.split()[1:])
        elif full_string(s1, s2, ' '.join(subs.split()[:-1])):
          subs = ' '.join(subs.split()[:-1])
          subs = ' '.join(subs.split()[1:-1])
      return subs

    现在函数longest_whole_substring(s1, s2)将提供最长的整个子字符串,而不是切断任何单词。让我们在每个案例中测试一下:


    >>> a = 'this is a foo bar bar foo string'
    >>> b = 'foo bar'
    >>> longest_whole_substring(a,b)
    'foo bar'


    >>> b = 's a foo bar'
    >>> longest_whole_substring(a,b)
    'a foo bar '


    >>> b = 'foo bar f'
    >>> longest_whole_substring(a,b)
    'foo bar'


    >>> b = 's a foo bar f'
    >>> longest_whole_substring(a,b)
    'a foo bar'


答案 4 :(得分:1)




def longest_common_substring(s1, s2):
  m = [[0] * (1 + len(s2)) for i in xrange(1 + len(s1))]
  longest, x_longest = 0, 0
  for x in xrange(1, 1 + len(s1)):
    # current character in s1
    x_char = s1[x - 1]
    # we are at the beginning of a word in s1 if
    #   (we are at the beginning of s1) or 
    #   (previous character is a space)
    x_word_begin = (x == 1) or (s1[x - 2] == " ")
    # we are at the end of a word in s1 if
    #   (we are at the end of s1) or 
    #   (next character is a space)
    x_word_end = (x == len(s1)) or (s1[x] == " ")
    for y in xrange(1, 1 + len(s2)):
      # current character in s2
      y_char = s2[y - 1]
      # we are at the beginning of a word in s2 if
      #   (we are at the beginning of s2) or 
      #   (previous character is a space)
      y_word_begin = (y == 1) or (s2[y - 2] == " ")
      # we are at the end of a word in s2 if
      #   (we are at the end of s2) or 
      #   (next character is a space)
      y_word_end = (y == len(s2)) or (s2[y] == " ")
      if x_char == y_char:
        # no match starting with x_char
        if m[x - 1][y - 1] == 0:
          # a match can start only with a space
          #   or at the beginning of a word
          if x_char == " " or (x_word_begin and y_word_begin):
              m[x][y] = m[x - 1][y - 1] + 1
          m[x][y] = m[x - 1][y - 1] + 1
        if m[x][y] > longest:
          # the match can end only with a space
          #   or at the end of a word
          if x_char == " " or (x_word_end and y_word_end):
            longest = m[x][y]
            x_longest = x
        m[x][y] = 0
  return s1[x_longest - longest: x_longest]

答案 5 :(得分:1)


def common_phrase(self, longer, shorter):
""" recursively find longest common substring, consists of whole words only and in the same order """
if shorter in longer:
    return shorter
elif len(shorter.split()) > 1:
    common_phrase_without_last_word = common_phrase(shorter.rsplit(' ', 1)[0], longer)
    common_phrase_without_first_word = common_phrase(shorter.split(' ', 1)[1], longer)
    without_first_is_longer = len(common_phrase_without_last_word) < len(common_phrase_without_first_word)

    return ((not without_first_is_longer) * common_phrase_without_last_word +
            without_first_is_longer * common_phrase_without_first_word)
    return ''


if len(str1) > len(str2):
    longer, shorter = str1, str2 
    longer, shorter = str2, str1

答案 6 :(得分:0)


def ngrams(text, n):
  return [text[i:i+n] for i in xrange(len(text)-n)]

def longest_common_ngram(s1, s2):
  s1ngrams = list(chain(*[[" ".join(j) for j in ngrams(s1.split(), i)] 
                          for i in range(1, len(s1.split()))]))
  s2ngrams = list(chain(*[[" ".join(j) for j in ngrams(s2.split(), i)]
                          for i in range(1, len(s2.split()))]))

  return set(s1ngrams).intersection(set(s2ngrams))

答案 7 :(得分:0)


有关Python后缀树实现的列表,请参阅python: library for generalized suffix trees的已接受答案。

答案 8 :(得分:0)

from difflib import SequenceMatcher
def longest_substring(str1, str2):
    # initialize SequenceMatcher object with
    # input string
    # below logic is to make sure word does not get cut
    str1 = " " + str1.strip() + " "
    str2 = " " + str2.strip() + " "
    seq_match = SequenceMatcher(None, str1, str2)

    # find match of longest sub-string
    # output will be like Match(a=0, b=0, size=5)
    match = seq_match.find_longest_match(0, len(str1), 0, len(str2))

    # return longest substring
    if match.size != 0:
        lm = str1[match.a: match.a + match.size]
        # below logic is to make sure word does not get cut
        if not lm.startswith(" "):
            while not (lm.startswith(" ") or len(lm) == 0):
                lm = lm[1:]
        if not lm.endswith(" "):
            while not (lm.endswith(" ") or len(lm) == 0):
                lm = lm[:-1]
        return lm.strip()
        return ""