我想确定一个字符串是否是一个禁用词,我为此编写了一个python cod但是我没有得到正确的结果 代码是
stopwords = [ "a","about","above","after","again","against","all","am","an","and","any","are","aren't","as","at","be","because","been","before","being","below","between","both","but","by","can't","cannot","could","couldn't","did","didn't","do","does","doesn't","doing","don't","down","during","each","few","for","from","further","had","hadn't","has","hasn't","have","haven't","having","he","he'd","he'll","he's","her","here","here's","hers","herself","him","himself","his","how","how's","i","i'd","i'll","i'm","i've","if","in","into","is","isn't","it","it's","its","itself","let's","me","more","most","mustn't","my","myself","no","nor","not","of","off","on","once","only","or","other","ought","our","ours ourselves","out","over","own","same","shan't","she","she'd","she'll","she's","should","shouldn't","so","some","such","than","that","that's","the","their","theirs","them","themselves","then","there","there's","these","they","they'd","they'll","they're","they've","this","those","through","to","too","under","until","up","very","was","wasn't","we","we'd","we'll","we're","we've","were","weren't","what","what's","when","when's","where","where's","which","while","who","who's","whom","why","why's","with","won't","would","wouldn't","you","you'd","you'll","you're","you've","your","yours","yourself","yourselves"];
file="C:/Python26/test.txt";
f=open("stopwords.txt",'w');
with open(file,'r') as rf:
lines = rf.readlines();
for word in lines:
if word in stopwords:
f.write(word.strip("\n")+"\t"'1'"\n");
else:
f.write(word.strip("\n")+"\t"'0'"\n");
f.close();
结果我对每个存储在test.txt文件中的标记/字符串得到0
答案 0 :(得分:5)
基本上,您要将行与停用词中的停用词进行比较,因为您正在迭代句子/ rf.readlines()
返回的行不是单个字词。您需要在每个行中迭代每个字,因此需要额外的for循环。因此,如下所示添加额外的for循环,以迭代每行中的每个单词:
for line in lines:
for word in line.split(): # split() splits the line on white-spaces
if word in stopwords:
f.write(word.strip("\n")+"\t"'1'"\n");
else:
f.write(word.strip("\n")+"\t"'0'"\n");
f.close();
答案 1 :(得分:0)
问题在于如何拆分line
,一个不错的选择是使用列表解析,Split字符串列出并迭代列表。
stopwords = ["a", "about", "above", "after", "again", "against", "all", "am", "an", "and", "any", "are", "aren't", "as", "at", "be", "because", "been", "before", "being", "below", "between", "both", "but", "by", "can't", "cannot", "could", "couldn't", "did", "didn't", "do", "does", "doesn't", "doing", "don't", "down", "during", "each", "few", "for", "from", "further", "had", "hadn't", "has", "hasn't", "have", "haven't", "having", "he", "he'd", "he'll", "he's", "her", "here", "here's", "hers", "herself", "him", "himself", "his", "how", "how's", "i", "i'd", "i'll", "i'm", "i've", "if", "in", "into", "is", "isn't", "it", "it's", "its", "itself", "let's", "me", "more", "most", "mustn't", "my", "myself", "no", "nor", "not", "of", "off", "on", "once", "only", "or", "other", "ought", "our", "ours ourselves", "out", "over", "own", "same", "shan't", "she", "she'd", "she'll", "she's", "should", "shouldn't", "so", "some", "such", "than", "that", "that's", "the", "their", "theirs", "them", "themselves", "then", "there", "there's", "these", "they", "they'd", "they'll", "they're", "they've", "this", "those", "through", "to", "too", "under", "until", "up", "very", "was", "wasn't", "we", "we'd", "we'll", "we're", "we've", "were", "weren't", "what", "what's", "when", "when's", "where", "where's", "which", "while", "who", "who's", "whom", "why", "why's", "with", "won't", "would", "wouldn't", "you", "you'd", "you'll", "you're", "you've", "your", "yours", "yourself", "yourselves"]
def stop_word_test(test_word):
if test_word in stopwords:
return test_word.strip("\n")+"\t"'1'"\n"
else:
return test_word.strip("\n")+"\t"'0'"\n"
with open("c:\\stopwords.txt", 'w') as write_file:
with open("C:\\test.txt", 'r') as r_file:
[write_file.write(value) for value in [stop_word_test(word) for line in r_file.readlines() for word in "".join((char if char.isalpha() else " ") for char in line).split()]]
在上面的例子中,我们通过任何不是字母的标点符号来固定线条。
此外,;
在python中也没有必要。