在python中识别字符串是否为禁用词

时间:2015-01-05 14:39:52

标签: python

我想确定一个字符串是否是一个禁用词,我为此编写了一个python cod但是我没有得到正确的结果 代码是

stopwords = [ "a","about","above","after","again","against","all","am","an","and","any","are","aren't","as","at","be","because","been","before","being","below","between","both","but","by","can't","cannot","could","couldn't","did","didn't","do","does","doesn't","doing","don't","down","during","each","few","for","from","further","had","hadn't","has","hasn't","have","haven't","having","he","he'd","he'll","he's","her","here","here's","hers","herself","him","himself","his","how","how's","i","i'd","i'll","i'm","i've","if","in","into","is","isn't","it","it's","its","itself","let's","me","more","most","mustn't","my","myself","no","nor","not","of","off","on","once","only","or","other","ought","our","ours    ourselves","out","over","own","same","shan't","she","she'd","she'll","she's","should","shouldn't","so","some","such","than","that","that's","the","their","theirs","them","themselves","then","there","there's","these","they","they'd","they'll","they're","they've","this","those","through","to","too","under","until","up","very","was","wasn't","we","we'd","we'll","we're","we've","were","weren't","what","what's","when","when's","where","where's","which","while","who","who's","whom","why","why's","with","won't","would","wouldn't","you","you'd","you'll","you're","you've","your","yours","yourself","yourselves"];
file="C:/Python26/test.txt";
f=open("stopwords.txt",'w');
with open(file,'r') as rf:
    lines = rf.readlines();
    for word in lines:
        if word in stopwords:
            f.write(word.strip("\n")+"\t"'1'"\n");            
        else:
            f.write(word.strip("\n")+"\t"'0'"\n");
    f.close();

结果我对每个存储在test.txt文件中的标记/字符串得到0

2 个答案:

答案 0 :(得分:5)

基本上,您要将停用词中的停用词进行比较,因为您正在迭代句子/ rf.readlines()返回的行不是单个字词。您需要在每个中迭代每个,因此需要额外的for循环。因此,如下所示添加额外的for循环,以迭代每行中的每个单词:

for line in lines:
    for word in line.split():  # split() splits the line on white-spaces
        if word in stopwords:
            f.write(word.strip("\n")+"\t"'1'"\n");            
        else:
            f.write(word.strip("\n")+"\t"'0'"\n");
    f.close();

答案 1 :(得分:0)

问题在于如何拆分line,一个不错的选择是使用列表解析,Split字符串列出并迭代列表。

stopwords = ["a", "about", "above", "after", "again", "against", "all", "am", "an", "and", "any", "are", "aren't", "as", "at", "be", "because", "been", "before", "being", "below", "between", "both", "but", "by", "can't", "cannot", "could", "couldn't", "did", "didn't", "do", "does", "doesn't", "doing", "don't", "down", "during", "each", "few", "for", "from", "further", "had", "hadn't", "has", "hasn't", "have", "haven't", "having", "he", "he'd", "he'll", "he's", "her", "here", "here's", "hers", "herself", "him", "himself", "his", "how", "how's", "i", "i'd", "i'll", "i'm", "i've", "if", "in", "into", "is", "isn't", "it", "it's", "its", "itself", "let's", "me", "more", "most", "mustn't", "my", "myself", "no", "nor", "not", "of", "off", "on", "once", "only", "or", "other", "ought", "our", "ours    ourselves", "out", "over", "own", "same", "shan't", "she", "she'd", "she'll", "she's", "should", "shouldn't", "so", "some", "such", "than", "that", "that's", "the", "their", "theirs", "them", "themselves", "then", "there", "there's", "these", "they", "they'd", "they'll", "they're", "they've", "this", "those", "through", "to", "too", "under", "until", "up", "very", "was", "wasn't", "we", "we'd", "we'll", "we're", "we've", "were", "weren't", "what", "what's", "when", "when's", "where", "where's", "which", "while", "who", "who's", "whom", "why", "why's", "with", "won't", "would", "wouldn't", "you", "you'd", "you'll", "you're", "you've", "your", "yours", "yourself", "yourselves"]

def stop_word_test(test_word):
    if test_word in stopwords:
        return test_word.strip("\n")+"\t"'1'"\n"
    else:
        return test_word.strip("\n")+"\t"'0'"\n"

with open("c:\\stopwords.txt", 'w') as write_file:
    with open("C:\\test.txt", 'r') as r_file:
        [write_file.write(value) for value in [stop_word_test(word) for line in r_file.readlines() for word in "".join((char if char.isalpha() else " ") for char in line).split()]]

在上面的例子中,我们通过任何不是字母的标点符号来固定线条。

此外,;在python中也没有必要。