将重复值列表转换为python中频率计数的字典

时间:2015-11-19 17:25:21

标签: python list csv dictionary

我有一个包含重复值的字符串列表,我想创建单词字典,其中key将是单词,其值将是频率计数,然后在csv中写下这些单词及其值:

以下是我做同样的方法:

#!/usr/bin/env python
# encoding: utf-8

# -*- coding: utf8 -*-
import csv
from nltk.tokenize import TweetTokenizer
import numpy as np

tknzr = TweetTokenizer()

#print tknzr.tokenize(s0)

with open("dispn.csv","r") as file1,\
     open("dispn_tokenized.csv","w") as file2,\
     open("dispn_tokenized_count.csv","w") as file3:

     mycsv = list(csv.reader(file1))

     words = []
     words_set = []
     tokenize_count = {}
     for row in mycsv:

         lst = tknzr.tokenize(row[2])
         for l in lst:
             file2.write("\""+str(row[2])+"\""+","+"\""+str(l.encode('utf-8'))+"\""+"\n")
             l = l.lower()
             words.append(l)
     words_set = list(set(words))
     print "len of words_set : " + str(len(words_set))
     for word in words_set:
        tokenize_count[word] = 1

     for word in words:
        tokenize_count[word] = tokenize_count[word]+1




     print "len of tokenized words_set : " + str(len(tokenize_count))

     #print "Tokenized_words count : "
     #print tokenize_count
     #print "================================================================="

     i = 0
     for wrd in words_set:
       #i = i+1
       print "i : " +str(i)
       file3.write("\""+str(i)+"\""+","+"\""+str(wrd.encode('utf-8'))+"\""+","+"\""+str(tokenize_count[wrd])+"\""+"\n")

但在csv中我仍然发现了一些重复值,如1,5,4,7,9。

该方法的一些信息:

    - dispn.csv = contains usernames of the users 
      which i am tokenizing with the help of nltk module
    - after tokenizing them, i am storing these words in the list 'words' 
      and writing the words corresponding to the username to csv. 
    - creating set of it so as to get unique values out of list 'words' 
      and storing it in 'words_set'
    - then creating dictionary 'tokenize_count' with key as word and 
      value as its frequency count and writing the same to csv.  

为什么我只重复了一些数值?有没有更好的方法来做同样的事情?请帮忙。

1 个答案:

答案 0 :(得分:1)

`import Counter from collections

可以在字符串列表上调用计数器并返回类似字典的对象,其中键值是单词及其频率