Question

我有一个英文电影名称列表和一些用文本文件编译的外语，每个名字都用新行打印：

Kein Pardon
Kein Platz f¸r Gerold
Kein Sex ist auch keine Lˆsung
Keine Angst Liebling, ich pass schon auf
Keiner hat das Pferd gek¸sst
Keiner liebt mich
Keinohrhasen
Keiro's Cat
La Prima Donna
La Primeriza
La Prison De Saint-Clothaire
La Puppe
La P·jara
La PÈrgola de las Flores

我编写了一份常见的非英语停用词的简短列表，我想从文本文件中过滤掉例如。 La，de，las，das。我该怎么做才能阅读我的文字，过滤单词，然后将过滤后的列表打印成原始格式的新文本文件？所需的输出大致应如下所示：

Kein Pardon
Kein Platz f¸r Gerold
Kein Sex keine Lˆsung
Keine Angst Liebling, pass schon
Keiner hat Pferd gek¸sst
Keiner liebt mich
Keinohrhasen
Keiro's Cat
Prima Donna
Primeriza
Prison Saint-Clothaire
Puppe
P·jara
Èrgola Flores

为了澄清，我知道有一种方法可以使用NLTK库，它有一个更全面的停用词列表，但我正在寻找一种替代方案，我只是针对一些选定的词来自我自己的名单。

Answer 1

您可以使用re模块（https://docs.python.org/2/library/re.html#re.sub）用空格替换不需要的字符串。这样的事情应该有效：

    import re
    #save your undesired text here. You can use a different data structure
    #  if the list is big and later build your match string like below
    unDesiredText = 'abc|bcd|vas'

    #set your inputFile and outputFile appropriately
    fhIn = open(inputFile, 'r')
    fhOut = open(outputFile, 'w')

    for line in fhIn:
        line = re.sub(unDesiredText, '', line)
        fhOut.write(line)

    fhIn.close()
    fhOut.close

Answer 2

另一种方法，如果您对异常处理和其他相关细节感兴趣：

import re

stop_words = ['de', 'la', 'el']
pattern = '|'.join(stop_words)
prog = re.compile(pattern, re.IGNORECASE)  # re.IGNORECASE to catch both 'La' and 'la' 

input_file_location = 'in.txt'
output_file_location = 'out.txt'

with open(input_file_location, 'r') as fin:
    with open(output_file_location, 'w') as fout:
        for l in fin:
            m = prog.sub('', l.strip())  # l.strip() to remove leading/trailing whitespace
            m = re.sub(' +', ' ', m)  # suppress multiple white spaces
            fout.write('%s\n' % m.strip())

Answer 3

读入文件：

with open('file', 'r') as f:
    inText = f.read()

有一些功能，你提供了一个你不想在文本中的字符串，但你可以一次完成整个文本，而不是一行一行。此外，您希望全局使用该文本，因此我要说出一个类：

class changeText( object ):
    def __init__(self, text):
        self.text = text
    def erase(self, badText):
        self.text.replace(badText, '')

但是，当你用一个单词替换一个单词时，会出现双重空格，以及\ n后跟空格，所以请创建一个方法来清理生成的文本。

    def cleanup(self):
        self.text.replace('  ', ' ')
        self.text.replace('\n ', '\n')

初始化对象：

textObj = changeText( inText )

然后遍历坏词列表并清理：

for bw in badWords:
    textObj.erase(bw)
textObj.cleanup()

最后，写下来：

with open('newfile', 'r') as f:
    f.write(textObj.text)

在文本文件中过滤外部停用词

3 个答案: