在具有特定单词集的标题之间缩小段落

时间:2017-09-18 18:20:20

标签: python grep information-extraction

我有一个包含以下数据的文本文件:

History

The term "data science" (originally used interchangeably with "datalogy") has existed for over thirty years and was used initially as a substitute for computer science by Peter Naur in 1960. In 1974, Naur published Concise Survey of Computer Methods, which freely used the term data science in its survey of the contemporary data processing methods that are used in a wide range of application

Application 

In the 2010–2011 time frame, data science software reached an inflection point where open source software started supplanting proprietary software.[30] The use of open source software enables modifying and extending the software, and it allows sharing of the resulting algorithms

现在我想提取包含特定字词集的段落或特定部分,例如{" Software", opensource" }

我已尝试regexpif loop,但无法提取所需的输出,任何人都可以帮助我。

2 个答案:

答案 0 :(得分:1)

使用正则表达式:

$a[0][0]

您最终会在列表import re my_string = """History The term "data science" (originally used interchangeably with "datalogy") has existed for over thirty years and was used initially as a substitute for computer science by Peter Naur in 1960. In 1974, Naur published Concise Survey of Computer Methods, which freely used the term data science in its survey of the contemporary data processing methods that are used in a wide range of application Application In the 2010–2011 time frame, data science software reached an inflection point where open source software started supplanting proprietary software.[30] The use of open source software enables modifying and extending the software, and it allows sharing of the resulting algorithms """ pattern = '\n.+(?:software|open\s?source).+\n' paragraph_list = re.findall(pattern, my_string) print(paragraph_list)

中列出您提及的所有关键字段落

修改

如果您希望关键字是动态的,或者由列表/元组提供:

paragraph_list

答案 1 :(得分:0)

您可以轻松找到子字符串是否是较大字符串的一部分:

>>> str='In the 2010–2011 time frame, data science software reached an inflection point where open source software started supplanting proprietary software.[30] The use of open source software enables modifying and extending the software, and it allows sharing of the resulting algorithms'
>>> "software" in str
True

您可以提取包含特定单词的文件行:

>>> f = open('yourfile.txt','r')
>>> result=[i for i in data if 'software' in i]