如何在文本中搜索特定段落?

时间:2018-06-27 12:18:42

标签: python string nlp topic-modeling

我正在寻找从字符串中提取特定段落的方法。我有很多文档要用于主题建模,但是其中包含表格,图形,标题等。我只想使用文档中通常包含的摘要。但是摘要并未明确声明。

我将PDF转换为文本,并尝试了类似的方法,但是效果不佳,因为摘要总是以不同的方式声明:

def get_summary(text):

subject = ""
copy = False
textlines = text.splitlines()

for line in textlines:
    #print line
    if line.strip() == 'SUMMARY_BEGIN':
        copy = True
    elif line.strip() == 'SUMMARY_END':
        copy = False
    elif copy:
        #print(line)
        subject += line

return subject

我不想在100个可能的子字符串之间搜索摘要。

编辑:相似的示例:

Date
21 Jun 2017

name name [abc]
name name [abc]
name name [cbd]
name name
name name
name name
name name
name name

nonsense-word1

nonsense-word1
nonsense-word1

12354
37264324

Summary:
Here is the only part I want to extract out of my document. Here is the only part I want to extract out of my document. Here is the only part I want to extract out of my document. 
Here is the only part I want to extract out of my document. Here is the only part I want to extract out of my document.
Here is the only part I want to extract out of my document. Here is the only part I want to extract out of my document. 
Here is the only part I want to extract out of my document. Here is the only part I want to extract out of my document. 

Here is the only part I want to extract out of my document. Here is the only part I want to extract out of my document. Here is the only part I want to extract out of my document. 

Here is the only part I want to extract out of my document. Here is the only part I want to extract out of my document. Here is the only part I want to extract out of my document. 
Here is the only part I want to extract out of my document. Here is the only part I want to extract out of my document. 


32 463264 
324324
324432
32424

nonsense-word2

nonsense-word2
nonsense-word2
nonsense-word2

nonsense-word2

nonsense-word2

324
24442

name name
name name
name name
name name

3244324324

Date
21 Jun 2017

Date
21 Jun 2017

Date
21 Jun 2017

electronically validated

electronically validated

electronically validated

electronically validated
electronically validated


763254 3276 4276457234

3 个答案:

答案 0 :(得分:0)

您可以编写仅捕获句子的正则表达式。这将与连续的至少2个句子的第一个序列(以大写字母开头)匹配。

(?:[A-Z][^\n.]+.\s*){2,}

https://regex101.com/r/blK6sf/1

答案 1 :(得分:0)

为什么不在文档中查找超过N个单词的句子。 这些可能是真实的句子,而不是没用的行。

另一种方法是知道哪些单词仅出现在真实句子中。 一些简单的单词可能只出现在真实的段落中。例如,您可以使用简单的grep

检索的文章或介词

答案 2 :(得分:0)

只需让re做繁重的工作即可;)

import re

def get_summary(text):
    return re.search(
        r'\nSummary:\n(?P<content>.*?)[\d\s]{6,}',
        text,
        flags=re.MULTILINE | re.DOTALL,
    ).group('content')