从本地文本文件中刮取相关的文本段落

时间:2016-11-20 14:09:28

标签: python file text screen-scraping local

我正在寻找一段能够抓取文本相关部分的python代码。假设我有一组单词,当遇到其中一个单词时,它会在它找到单词的句子之前和之前擦除1或2个句子。 然后它应该打印下面的文本,以便可以复制。

例如,请参阅下面的文字。让我们说相关的词是“简单的”。它在第3行检测到“简单”。因此它会擦除第2,3和4行。

美丽胜过丑陋。显式优于隐式。简单比复杂更好。复杂比复杂更好。可读性很重要。

成为 - >

'明确比隐含更好。简单比复杂更好。复杂比复杂更好。'

我相信代码的想法很简单。但是我不知道如何实现这个目标。

import re

caps = "([A-Z])"
prefixes = "(Mr|St|Mrs|Ms|Dr)[.]"
suffixes = "(Inc|Ltd|Jr|Sr|Co)"
starters = "(Mr|Mrs|Ms|Dr|He\s|She\s|It\s|They\s|Their\s|Our\s|We\s|But\s|However\s|That\s|This\s|Wherever)"
acronyms = "([A-Z][.][A-Z][.](?:[A-Z][.])?)"
websites = "[.](com|net|org|io|gov)"

def split_into_sentences(text):
    text = " " + text + "  "
    text = text.replace("\n"," ")
    text = re.sub(prefixes,"\\1<prd>",text)
    text = re.sub(websites,"<prd>\\1",text)
    if "Ph.D" in text: text = text.replace("Ph.D.","Ph<prd>D<prd>")
    text = re.sub("\s" + caps + "[.] "," \\1<prd> ",text)
    text = re.sub(acronyms+" "+starters,"\\1<stop> \\2",text)
    text = re.sub(caps + "[.]" + caps + "[.]" + caps + "[.]","\\1<prd>\\2<prd>\\3<prd>",text)
    text = re.sub(caps + "[.]" + caps + "[.]","\\1<prd>\\2<prd>",text)
    text = re.sub(" "+suffixes+"[.] "+starters," \\1<stop> \\2",text)
    text = re.sub(" "+suffixes+"[.]"," \\1<prd>",text)
    text = re.sub(" " + caps + "[.]"," \\1<prd>",text)
    if "\"" in text: text = text.replace(".\"","\".")
    if "!" in text: text = text.replace("!\"","\"!")
    if "?" in text: text = text.replace("?\"","\"?")
    text = text.replace(".",".<stop>")
    text = text.replace("?","?<stop>")
    text = text.replace("!","!<stop>")
    text = text.replace("<prd>",".")
    sentences = text.split("<stop>")
    sentences = sentences[:-1]
    sentences = [s.strip() for s in sentences]
    return sentences

relevantwords = ["refugees","conflicts","mobility", "rights", "presence", "freedom", "immigrants", "politics", "political"] 


    for i in range(20):

    file = open("text"+str(i)+".txt", "r")
    data = file.readlines()

    for line in split_into_sentences(str(data)):
        if "relevantwords" in line:
            print str(i–1,i,i+1)
            print str(line).encode('UTF-8')
            print "\n"

1 个答案:

答案 0 :(得分:0)

我将简要介绍一些代码,如果您无法实施,请随时发布您的代码,我们很乐意帮助您解决问题!

你想:

  1. 将文件读入程序中的字符串
  2. 通过拆分字符'.'将字符串拆分为句子。请注意,如果你有&#34; mr。&#34;这样的缩写词,它会认为句子的结尾。
  3. 现在迭代句子列表,并在每次迭代中执行:
    • 检查单词是否在句子i中。如果是,请打印句子i-1ii+1
    • 或者,如果您不想将它们打印出来,可以将它们添加到您在开头创建的列表中
  4. 如果您对如何实现此问题有任何具体问题,请与我们联系!