Python正则表达式提取一部分字符串

时间:2015-10-04 01:00:13

标签: python regex python-2.7

我想提取大字符串的一部分。在之前和之后的单词数量上有一个目标单词和一个上限。因此,提取的子字符串必须包含目标字以及它之前和之后的上限字。如果目标单词更接近文本的开头或结尾,则前后部分可以包含较少的单词。

Eample string

“Lorem ipsum dolor sit amet,consectetur adipiscing elit,sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.Ut enim ad minim veniam,quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.Duis aute irure dolor in代表在velitate velit esse cillum dolore eu fugiat nulla pariatur。Excepteur sint occaecat cupidatat non proident,sunt in culpa qui officia deserunt mollit anim id est laborum。“

目标词: 劳动力

words_before: 5

words_after: 2

应该返回['veniam, quis nostrud exercitation ullamco laboris nisi ut']

我想到了几种可能的模式,但没有一种模式可行。我想也可以通过简单地从目标单词前后遍历字符串来完成。然而,正则表达式肯定会让事情变得更容易。任何帮助将不胜感激。

3 个答案:

答案 0 :(得分:5)

如果要拆分字词,可以使用slice()split()功能。例如:

>>> text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod
 tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, qu
is nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
 Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu
 fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in 
culpa qui officia deserunt mollit anim id est laborum.".split()

>>> n = text.index('laboris')
>>> s = slice(n - 5, n + 3)

>>> text[s]
['veniam,', 'quis', 'nostrud', 'exercitation', 'ullamco', 'laboris', 'nisi', 'ut']

答案 1 :(得分:3)

If you still want regex....

def find_context(word_, n_before, n_after, string_):
    import re

    b= '\w+\W+'  * n_before
    a=  '\W+\w+' * n_after
    pattern = '(' + b + word_ + a + ')'

    print(re.search(pattern, string_).groups(1)[0])


find_context('laboris', 5, 2, st)

veniam, quis nostrud exercitation ullamco laboris nisi ut

find_context('culpa', 2, 2, st)

sunt in culpa qui officia

答案 2 :(得分:2)

您也可以使用nltk进行处理,"concordance" method,受Calling NLTK's concordance - how to get text before/after a word that was used?的启发:

  

一致性视图向我们展示了给定单词的每次出现   有一些背景

import nltk


def get_neighbors(input_text, word, before, after):
    text = nltk.Text(nltk.tokenize.word_tokenize(input_text))

    concordance_index = nltk.ConcordanceIndex(text.tokens)
    offset = next(offset for offset in concordance_index.offsets(word))

    return text.tokens[offset - before - 1: offset] + text.tokens[offset: offset + after + 1]

text = u"Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."  
print(get_neighbors(text, 'laboris', 5, 2))

在目标词之前打印5个单词/令牌,在之后打印2个:

[u'veniam', u',', u'quis', u'nostrud', u'exercitation', u'ullamco', u'laboris', u'nisi', u'ut']