我只是从python和nltk开始,尝试从csv文件读取记录并确定所有记录中特定单词的出现频率。我可以做这样的事情:
with f:
reader = csv.reader(f)
# Skip the header
next(reader)
for row in reader:
note = row[4]
tokens = [t for t in note.split()]
# Calculate raw frequency distribution
freq = nltk.FreqDist(tokens)
for key,val in freq.items():
print (str(key) + ':' + str(val))
# Plot the results
freq.plot(20, cumulative=False)
我不确定如何修改此记录,以便所有记录的频率都很高,并且仅包含我感兴趣的单词。抱歉,这是一个非常简单的问题。
答案 0 :(得分:0)
您可以在循环freq_all = nltk.FreqDist()
之外定义计数器,然后在每行freq_all.update(tokens)
上对其进行更新
with f:
reader = csv.reader(f)
# Skip the header
next(reader)
freq_all = nltk.FreqDist()
for row in reader:
note = row[4]
tokens = [t for t in note.split()]
# Calculate raw frequency distribution
freq = nltk.FreqDist(tokens)
freq_all.update(tokens)
for key,val in freq.items():
print (str(key) + ':' + str(val))
# Plot the results
freq.plot(20, cumulative=False)
# Plot the overall results
freq_all.plot(20, cumulative=False)