我正在寻找一段能够抓取文本相关部分的python代码。假设我有一组单词,当遇到其中一个单词时,它会在它找到单词的句子之前和之前擦除1或2个句子。 然后它应该打印下面的文本,以便可以复制。
例如,请参阅下面的文字。让我们说相关的词是“简单的”。它在第3行检测到“简单”。因此它会擦除第2,3和4行。
美丽胜过丑陋。显式优于隐式。简单比复杂更好。复杂比复杂更好。可读性很重要。
成为 - >
'明确比隐含更好。简单比复杂更好。复杂比复杂更好。'
我相信代码的想法很简单。但是我不知道如何实现这个目标。
import re
caps = "([A-Z])"
prefixes = "(Mr|St|Mrs|Ms|Dr)[.]"
suffixes = "(Inc|Ltd|Jr|Sr|Co)"
starters = "(Mr|Mrs|Ms|Dr|He\s|She\s|It\s|They\s|Their\s|Our\s|We\s|But\s|However\s|That\s|This\s|Wherever)"
acronyms = "([A-Z][.][A-Z][.](?:[A-Z][.])?)"
websites = "[.](com|net|org|io|gov)"
def split_into_sentences(text):
text = " " + text + " "
text = text.replace("\n"," ")
text = re.sub(prefixes,"\\1<prd>",text)
text = re.sub(websites,"<prd>\\1",text)
if "Ph.D" in text: text = text.replace("Ph.D.","Ph<prd>D<prd>")
text = re.sub("\s" + caps + "[.] "," \\1<prd> ",text)
text = re.sub(acronyms+" "+starters,"\\1<stop> \\2",text)
text = re.sub(caps + "[.]" + caps + "[.]" + caps + "[.]","\\1<prd>\\2<prd>\\3<prd>",text)
text = re.sub(caps + "[.]" + caps + "[.]","\\1<prd>\\2<prd>",text)
text = re.sub(" "+suffixes+"[.] "+starters," \\1<stop> \\2",text)
text = re.sub(" "+suffixes+"[.]"," \\1<prd>",text)
text = re.sub(" " + caps + "[.]"," \\1<prd>",text)
if "\"" in text: text = text.replace(".\"","\".")
if "!" in text: text = text.replace("!\"","\"!")
if "?" in text: text = text.replace("?\"","\"?")
text = text.replace(".",".<stop>")
text = text.replace("?","?<stop>")
text = text.replace("!","!<stop>")
text = text.replace("<prd>",".")
sentences = text.split("<stop>")
sentences = sentences[:-1]
sentences = [s.strip() for s in sentences]
return sentences
relevantwords = ["refugees","conflicts","mobility", "rights", "presence", "freedom", "immigrants", "politics", "political"]
for i in range(20):
file = open("text"+str(i)+".txt", "r")
data = file.readlines()
for line in split_into_sentences(str(data)):
if "relevantwords" in line:
print str(i–1,i,i+1)
print str(line).encode('UTF-8')
print "\n"
答案 0 :(得分:0)
我将简要介绍一些代码,如果您无法实施,请随时发布您的代码,我们很乐意帮助您解决问题!
你想:
'.'
将字符串拆分为句子。请注意,如果你有&#34; mr。&#34;这样的缩写词,它会认为句子的结尾。i
中。如果是,请打印句子i-1
,i
和i+1
如果您对如何实现此问题有任何具体问题,请与我们联系!