我正在尝试实现我的代码,索引语料库,但是我一直收到错误并且不确定我哪里出错了,这里是代码片段。感谢您的帮助或指导。 错误减少=过滤器(lambda w:w不在停用词中,re.split(r' \ W +',words.lower())) AttributeError:' file'对象没有属性' lower'
#within your program ignore any term that matches a stop word.
#Creating .txt file with stopwords and thereby creating a routine in which stopwords are ignored and counting the number of words
stopwords = open(r'C:\User\Desktop\cacm\stopwords.txt',"r")
words = open (r'C:\Users\Desktop\cacm\cacm\index.dat',"r")
reduced = filter(lambda w: w not in stopwords, re.split(r'\W+', words.lower()))
counts= Counter (reduced)
print list ( reduced)
#Ignore any term that begins with a punctionary character
# Ignore any term that is a number
cleaned_text = re.sub(r'[^a-zA-Z0-9]', '', "C:\Users\Desktop\cacm\cacm\index.dat")
# Ignore any term that is 2 characters or shorter in length
shortword = re.compile(r'\W*\b\w{1,2}\b')
答案 0 :(得分:0)
open函数返回文件对象,而不是字符串列表。如果要从文本文件中获取行列表,可以在文件对象上使用readlines()方法。但是,迭代一次处理一行的文本文件可能更有效。它是由你决定。
另一方面,即使 words 是一个字符串列表,它也没有lower()方法。你需要遍历它,在列表的每个元素上调用lower()方法。
请记住,一旦完成文件对象,也要关闭文件对象。
您可以这样做:
file = open("stopwords.txt")
stopwords = file.readlines()
file.close()
file = open("index.dat")
words = [word.lower() for word in file.readlines() if not word.lower() in stopwords] #assuming a word per line
file.close()
您可以包含其他处理,例如以类似方式删除标点符号,或者在不同表达式的相同表达式中。这取决于您希望代码的紧凑程度与您愿意通过数据传递多少次。
我希望有所帮助。