Question

我正在尝试实现我的代码，索引语料库，但是我一直收到错误并且不确定我哪里出错了，这里是代码片段。感谢您的帮助或指导。错误减少=过滤器（lambda w：w不在停用词中，re.split（r＆＃39; \ W +＆＃39;，words.lower（））） AttributeError：＆＃39; file＆＃39;对象没有属性＆＃39; lower＆＃39;

#within your program ignore any term that matches a stop word.
#Creating .txt file with stopwords and thereby creating a routine in which stopwords are ignored and counting the number of words


    stopwords = open(r'C:\User\Desktop\cacm\stopwords.txt',"r")
    words = open (r'C:\Users\Desktop\cacm\cacm\index.dat',"r")
    reduced = filter(lambda w: w not in stopwords, re.split(r'\W+', words.lower()))
    counts= Counter (reduced)
    print list ( reduced)

    #Ignore any term that begins with a punctionary character
    # Ignore any term that is a number 
    cleaned_text = re.sub(r'[^a-zA-Z0-9]', '', "C:\Users\Desktop\cacm\cacm\index.dat")
    # Ignore any term that is 2 characters or shorter in length  
    shortword = re.compile(r'\W*\b\w{1,2}\b')

Answer 1

open函数返回文件对象，而不是字符串列表。如果要从文本文件中获取行列表，可以在文件对象上使用readlines（）方法。但是，迭代一次处理一行的文本文件可能更有效。它是由你决定。

另一方面，即使 words 是一个字符串列表，它也没有lower（）方法。你需要遍历它，在列表的每个元素上调用lower（）方法。

请记住，一旦完成文件对象，也要关闭文件对象。

您可以这样做：

file = open("stopwords.txt")
stopwords = file.readlines()
file.close()

file = open("index.dat")
words = [word.lower() for word in file.readlines() if not word.lower() in stopwords] #assuming a word per line
file.close()

您可以包含其他处理，例如以类似方式删除标点符号，或者在不同表达式的相同表达式中。这取决于您希望代码的紧凑程度与您愿意通过数据传递多少次。

我希望有所帮助。

如何使用python忽略txt文件中的停用词并计算单词数并消除标点符号？

1 个答案: