Question

我正在尝试编写一个代码，该代码将文本处理并最终索引所有这些代码。我首先需要删除非字母字符和标点符号，并将大写字母转换为小写字母，然后删除停用词。

这是我到目前为止所做的：

from stopwords import *

def removeStopwords(wordlist, flag):
    return [w for w in wordlist if w not in flag]

def preprocessing():
    import re
    with open('44.txt', 'r', encoding = 'utf8') as data:
        for line in data:
            a = line.rstrip().lower()
            result = re.sub('[^a-zA-Z]', ' ', a)
            b = removeStopwords(result, stopwords)
            print(b)

if __name__ == '__main__':
    preprocessing()

然后我把所有的字母分成了几个部分 ['a']，['w']，['o']，['l']，['f']

stopwords.py只是一个单词列表：

stopwords = ['a', 'are', 'aren t', ....]

有人可以告诉我发生了什么事吗？

谢谢你的时间！

Answer 1

Wordlist只是一个字符串。当你做的时候

w for w in wordlist if w not in flag

它正在迭代字符串的每个字符，因此您将获得单独的字母表。在传递给wordlist之前将removeStopwords转换为列表。

def preprocessing():
    import re
    with open('44.txt', 'r', encoding = 'utf8') as data:
        for line in data:
            a = line.rstrip().lower()
            result = re.sub('[^a-zA-Z]', ' ', a)
            result = result.split()#creates a list of words
            b = removeStopwords(result, stopwords)
            print(b)

Answer 2

正如jedward's answer所解释的那样，您的第一个问题是，尽管名称wordlist具有误导性，但您传递给removeStopwords的内容不是单词列表，而是字符串 - 一系列个人角色。

如果您的停止列表实际上完全由单个单词组成，解决方案很简单：将字符串拆分为单词，然后删除与停止列表匹配的单词。

不幸的是，如果您在停止列表中有aren t这样的内容，那就不会有效 - "These examples aren't good"会被预处理并拆分为"these examples aren t good"，这将分成["these", "examples", "aren", "t", "good"] 1}}，显然这些词都不匹配"aren t"。

理想的解决方案是删除字内标点符号，而不是将其转换为空格。像这样：

result = re.sub('[^a-zA-Z]', ' ', re.sub("['_]", '', a))

然后您最终得到"these examples arent good"，并且（假设您将停用词写为"arent"而不是"aren t"），简单的解决方案仍然有效。但是，这可能不适合您的要求 - 它正在改变规则。

所以，让我们说不能那样做。然后，如果你想保持简单，你需要实际过滤出子序列，而不仅仅是单个词。

所以，像这样：

def removeStopwords(line, stopwords):
    result = []
    wordlist = line.split()
    i = 0
    while i < len(wordlist):
        for stopword in stopwords:
            stopwordlist = stopword.split()
            if wordlist[i:i+len(stopwordlist)] == stopwordlist:
                i += len(stopwordlist)
                break
        else:
            result.append(wordlist[i])
            i += 1
    return ' '.join(result)

如果您需要更快，则需要将stopwords预处理为更好的数据结构，如trie，可以快速扫描匹配的前缀。

如何在预处理txt文件中使用stopword

2 个答案: