我正在寻找从字符串中提取特定段落的方法。我有很多文档要用于主题建模,但是其中包含表格,图形,标题等。我只想使用文档中通常包含的摘要。但是摘要并未明确声明。
我将PDF转换为文本,并尝试了类似的方法,但是效果不佳,因为摘要总是以不同的方式声明:
def get_summary(text):
subject = ""
copy = False
textlines = text.splitlines()
for line in textlines:
#print line
if line.strip() == 'SUMMARY_BEGIN':
copy = True
elif line.strip() == 'SUMMARY_END':
copy = False
elif copy:
#print(line)
subject += line
return subject
我不想在100个可能的子字符串之间搜索摘要。
编辑:相似的示例:
Date
21 Jun 2017
name name [abc]
name name [abc]
name name [cbd]
name name
name name
name name
name name
name name
nonsense-word1
nonsense-word1
nonsense-word1
12354
37264324
Summary:
Here is the only part I want to extract out of my document. Here is the only part I want to extract out of my document. Here is the only part I want to extract out of my document.
Here is the only part I want to extract out of my document. Here is the only part I want to extract out of my document.
Here is the only part I want to extract out of my document. Here is the only part I want to extract out of my document.
Here is the only part I want to extract out of my document. Here is the only part I want to extract out of my document.
Here is the only part I want to extract out of my document. Here is the only part I want to extract out of my document. Here is the only part I want to extract out of my document.
Here is the only part I want to extract out of my document. Here is the only part I want to extract out of my document. Here is the only part I want to extract out of my document.
Here is the only part I want to extract out of my document. Here is the only part I want to extract out of my document.
32 463264
324324
324432
32424
nonsense-word2
nonsense-word2
nonsense-word2
nonsense-word2
nonsense-word2
nonsense-word2
324
24442
name name
name name
name name
name name
3244324324
Date
21 Jun 2017
Date
21 Jun 2017
Date
21 Jun 2017
electronically validated
electronically validated
electronically validated
electronically validated
electronically validated
763254 3276 4276457234
答案 0 :(得分:0)
您可以编写仅捕获句子的正则表达式。这将与连续的至少2个句子的第一个序列(以大写字母开头)匹配。
(?:[A-Z][^\n.]+.\s*){2,}
答案 1 :(得分:0)
为什么不在文档中查找超过N个单词的句子。 这些可能是真实的句子,而不是没用的行。
另一种方法是知道哪些单词仅出现在真实句子中。 一些简单的单词可能只出现在真实的段落中。例如,您可以使用简单的grep
检索的文章或介词答案 2 :(得分:0)
只需让re
做繁重的工作即可;)
import re
def get_summary(text):
return re.search(
r'\nSummary:\n(?P<content>.*?)[\d\s]{6,}',
text,
flags=re.MULTILINE | re.DOTALL,
).group('content')