如何使用NLTK在CSV文件中查找特定单词的频率分布

时间:2019-02-03 01:40:52

标签: python nltk

我只是从python和nltk开始,尝试从csv文件读取记录并确定所有记录中特定单词的出现频率。我可以做这样的事情:

with f:
    reader = csv.reader(f)

    # Skip the header
    next(reader)

    for row in reader:
        note = row[4]
        tokens = [t for t in note.split()] 

        # Calculate raw frequency distribution
        freq = nltk.FreqDist(tokens) 
        for key,val in freq.items(): 
            print (str(key) + ':' + str(val))

        # Plot the results
        freq.plot(20, cumulative=False)

我不确定如何修改此记录,以便所有记录的频率都很高,并且仅包含我感兴趣的单词。抱歉,这是一个非常简单的问题。

1 个答案:

答案 0 :(得分:0)

您可以在循环freq_all = nltk.FreqDist()之外定义计数器,然后在每行freq_all.update(tokens)上对其进行更新

with f:
    reader = csv.reader(f)

    # Skip the header
    next(reader)
    freq_all = nltk.FreqDist()

    for row in reader:
        note = row[4]
        tokens = [t for t in note.split()] 

        # Calculate raw frequency distribution
        freq = nltk.FreqDist(tokens) 
        freq_all.update(tokens)
        for key,val in freq.items(): 
            print (str(key) + ':' + str(val))

        # Plot the results
        freq.plot(20, cumulative=False)

    # Plot the overall results
    freq_all.plot(20, cumulative=False)