在匹配的术语之前和之后抓住单词

时间:2017-05-16 02:01:43

标签: python regex

我在使用Python时迈出了第一步,我有一个问题要解决,我需要正则表达式。

我正在解析几行文字,我需要在某场比赛之前和之后抓住5个单词。要匹配的术语始终相同,并且行可以有多个该术语出现。

r"(?i)((?:\S+\s+){0,5})<tag>(\w*)</tag>\s*((?:\S+\s+){0,5})"

这适用于非常特殊的情况:如果标签之间只有一个术语出现(或者它们之间的间距很大),并且在第一次出现之前有足够的单词。

问题是:

1 - 如果第二次出现在第一次出现的+5范围内,则第二次出现没有-5,或者第二次被第一次出现吞没。重叠问题?

2 - 如果少于5个单词,或者你将范围提高到7或8,它只会将第一次出现跳到第二次或第三次。

这样一句话就像:

word word word match word word match word word word

不会很好地解析。

有没有办法考虑这些问题并使其有效?

提前谢谢大家!

1 个答案:

答案 0 :(得分:1)

这可能是你的后续 - 没有使用正则表达式

#!/usr/bin/env python



def find_words(s, count, needle):

  # split the string into a list
  lst = s.split()

  # get the index of the where the needle is
  idx = lst.index(needle)

  # s is the start and end of the list you need
  s = idx -count
  e = idx +count

  # print the list as slice notation
  print lst[s:e+1]


def find_occurrences_in_list(s, count, needle):
  # split the string into a list
  lst = s.split()

  idxList = [i for i, x in enumerate(lst) if x == needle]

  # print idxList

  r = []
  for n in idxList:
    s = n-count
    e = n+count
    # append the list as slice notation
    r.append(" ".join(lst[s:e+1]))

  print r

# the string of words
mystring1 = "zero one two three four five match six seven eight nine ten eleven"
# call function to find string, 5 words head & behind, looking for the word "match"
find_occurrences_in_list(mystring1, 5, "match")

# call function to find string, 3 words head & behind, looking for the word "nation"
mystring2 = "Four score and seven years ago our fathers brought forth on this continent a new nation conceived in Liberty and dedicated to the proposition"
find_occurrences_in_list(mystring2, 3, "nation")

mystring3 = "zero one two three four five match six seven match eight nine ten eleven"
find_occurrences_in_list(mystring3, 2, "match")


['one two three four five match six seven eight nine ten']
['continent a new nation conceived in Liberty']
['four five match six seven', 'six seven match eight nine']