Question

在Python中我试图创建一个列表（myClassifier），根据它是否包含，为列表（txtList）中存储的每个文本文件（txtEntry）附加一个分类（'bad'/'good'）存储在坏词列表中的坏词（badWord）。

txtList = ['mywords.txt', 'apple.txt, 'banana.txt', ... , 'something.txt']
badWord = ['pie', 'vegetable, 'fatigue', ... , 'something']

txtEntry只是一个占位符，实际上我只想迭代txtList中的每个条目。

我在响应中生成了以下代码：

for txtEntry in txtList:
    if badWord in txtEntry:
        myClassifier += 'bad'
    else:
        myClassifier += 'good'

但是我收到 TypeError：'in'要求字符串作为左操作数，而不是列表。

我猜测badWord需要是一个字符串而不是列表，但我不确定如何才能使其工作。

我怎么能做到这一点？

Answer 1

此

if badWord in txtEntry:

测试badWord是否等于textEntry中的任何子字符串。由于它是一个列表，它不会也不能 - 你需要做的是分别检查badWord中的每个字符串。最简单的方法是使用函数any。你需要规范化txtEntry，因为（如评论中所述）你关心的是匹配精确的单词，而不仅仅是子串（string in string测试的），你（可能）想要搜索不区分大小写：

import re

for txtEntry in txtList:
    # Ensure that `word in contents` doesn't give 
    # false positives for substrings - avoid eg, 'ass in class'
    contents = [w.lower() for w in re.split('\W+', txtEntry)]

    if any(word in contents for word in badWord):
         myClassifier.append('bad')
    else:
         myClassifer.append('good')

请注意，与其他答案一样，我使用list.append方法代替+=将字符串添加到列表中。如果您使用+=，您的列表最终会如下所示：['g', 'o', 'o', 'd', 'b', 'a', 'd']而不是['good', 'bad']。

根据对该问题的评论，如果您希望在仅存储其名称时检查文件的内容，则需要稍微调整一下 - 您需要调用open，然后你需要测试内容 - 但测试和规范化保持不变：

import re

for txtEntry in txtList:
    with open(txtEntry) as f:
        # Ensure that `word in contents` doesn't give 
        # false positives for substrings - avoid eg, 'ass in class'
        contents = [w.lower() for w in re.split('\W+', f.read())]
    if any(word in contents for word in badWord):
        myClassifier.append('bad')
    else:
        myClassifer.append('good')

这些循环都假定，在示例数据中，badWord中的所有字符串都是小写的。

Answer 2

要查找哪些文件中包含错误字词，您可以：

import re
from pprint import pprint

filenames = ['mywords.txt', 'apple.txt', 'banana.txt', 'something.txt']
bad_words = ['pie', 'vegetable', 'fatigue', 'something']

classified_files = {} # filename -> good/bad    
has_bad_words = re.compile(r'\b(?:%s)\b' % '|'.join(map(re.escape, bad_words)),
                           re.I).search
for filename in filenames:
    with open(filename) as file:
         for line in file:
             if has_bad_words(line):
                classified_files[filename] = 'bad'
                break # go to the next file
         else: # no bad words
             classified_files[filename] = 'good'

pprint(classified_files)

如果您想将'bad'标记为单词的不同变形形式，例如，如果cactus位于bad_words且您要排除cacti（复数形式）那么你可能需要使用词干分析器或更常见的词形变换器，例如，

from nltk.stem.porter import PorterStemmer # $ pip install nltk

stemmer = PorterStemmer()
print(stemmer.stem("pies")) 
# -> pie

或

from nltk.stem.wordnet import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize('cacti'))
# -> cactus

注意：您可能需要import nltk; nltk.download()才能下载wordnet数据。

可能更简单，只需将pies，cacti等所有可能的表单直接添加到bad_words列表。

Answer 3

您也应该循环遍历badWord项目，并且对于每个项目，您应该检查它是否存在于txtEntry中。

for txtEntry in txtList:
    if any(word in txtEntry for word in badWord)::
        myClassifier.append("bad") # append() is better and will give you the right output as += will add every letter in "bad" as a list item. or you should make it myClassifier += ['bad']
    else:
        myClassifier.append("good")

感谢@lvc评论

Answer 4

试试这段代码：

    myClassifier.append('bad')

在Python中对列表条目进行分类

4 个答案: