Question

如何访问工作目录中的文件的停用词？请注意我无法弄清楚如何访问NLTK停用词，因为1.我无法下载URL块2.无法找到NLTK_DATA文件，我将手动下载放在语料库目录中无济于事。

import collections

file=open(r'C:\Users\Istcrmt\Documents\Python\EdX Python for Data Science\word_cloud\98-0.txt')

# if you want to use stopwords, here's an example of how to do this
# stopwords = set(line.strip() for line in open('stopwords'))

# create your data structure here.  F
wordcount={}

# Instantiate a dictionary, and for every word in the file, add to 
# the dictionary if it doesn't exist. If it does, increase the count.

# Hint: To eliminate duplicates, remember to split by punctuation, 
# and use case demiliters. The functions lower() and split() will be useful!

for word in file.read().lower().split():
    word = word.replace(".","")
    word = word.replace(",","")
    word = word.replace("\"","")
    word = word.replace("“","")
    if word not in stopwords:
        if word not in wordcount:
            wordcount[word] = 1
        else:
            wordcount[word] += 1

# after building your wordcount, you can then sort it and return the first
# n words.  If you want, collections.Counter may be useful.

d = collections.Counter(wordcount)

#print(d.most_common(10))
for word, count in d.most_common(10):
    print(word, ": ", count)

工作目录中的Python访问文件

0 个答案: