从文本中提取单词两侧的25个单词

时间:2019-02-09 17:25:05

标签: python regex python-3.x regex-lookarounds

我有以下文本,我正在尝试使用此模式将25个单词提取到比赛的每一面。挑战在于匹配项重叠,因此python regex引擎仅进行一次匹配。如果有人可以帮助解决此问题,我将不胜感激

文字

2015年展望目前,公司提供以下2015年展望代替正式的财务指导。该展望不包括任何未来收购和与交易相关的成本的影响。收入-根据2014年第四季度的收入,在我们的一些设施中增加新项目以及先前对IMPORT的收购,公司预计当前100个项目的利用率将保持在一定的平均水平

我尝试了以下模式

pattern = r'(?<=outlook\s)((\w+.*?){25})'

这会创建一个匹配项,而我需要两个匹配项,并且一个匹配项是否与另一个匹配项都没关系

我基本上需要两场比赛

2 个答案:

答案 0 :(得分:1)

我根本不会使用正则表达式-python module re无法处理重叠范围...

text = """2015 Outlook The Company is providing the following outlook for 2015 in lieu of formal financial guidance at this time. This outlook does not include the impact of any future acquisitions and transaction-related costs. Revenues - Based on the revenues from the fourth quarter of 2014, the addition of new items at our some facility and the previously opened acquisition of Important Place, the Company expects utilization of the current 100 items to remain in some average"""

lookfor = "outlook"

# split text at spaces
splitted = text.lower().split()

# get the position in splitted where the words match (remove .,-?! for comparison) 
positions = [i for i,w in enumerate(splitted) if lookfor == w.strip(".,-?!")]


# printing here, you can put those slices in a list for later usage
for p in positions:    # positions is: [1, 8, 21]
    print( ' '.join(splitted[max(0,p-26):p+26]) )
    print()

输出:

2015 outlook the company is providing the following outlook for 2015 in lieu of formal financial guidance at this time. this outlook does not include the impact

2015 outlook the company is providing the following outlook for 2015 in lieu of formal financial guidance at this time. this outlook does not include the impact of any future acquisitions and transaction-related costs.

2015 outlook the company is providing the following outlook for 2015 in lieu of formal financial guidance at this time. this outlook does not include the impact of any future acquisitions and transaction-related costs. revenues - based on the revenues from the fourth quarter of 2014, the

通过迭代分割后的单词,您可以获得位置并分割了分割后的列表。即使0低于p-26,也请确保从切片的0开始,否则您将不会得到任何输出。 (-4表示从字符串的末尾开始)

答案 1 :(得分:0)

一种非正则表达式方式:

string = "2015 Outlook The Company is providing the following outlook for 2015 in lieu of formal financial guidance at this time. This outlook does not include the impact of any future acquisitions and transaction-related costs. Revenues - Based on the revenues from the fourth quarter of 2014, the addition of new items at our some facility and the previously opened acquisition of Important Place, the Company expects utilization of the current 100 items to remain in some average"
words = string.split()
starting25 = " ".join(words[:25])
ending25 = " ".join(words[-25:])
print(starting25)
print("\n")
print(ending25)