我有一个包含重复值的字符串列表,我想创建单词字典,其中key将是单词,其值将是频率计数,然后在csv中写下这些单词及其值:
以下是我做同样的方法:
#!/usr/bin/env python
# encoding: utf-8
# -*- coding: utf8 -*-
import csv
from nltk.tokenize import TweetTokenizer
import numpy as np
tknzr = TweetTokenizer()
#print tknzr.tokenize(s0)
with open("dispn.csv","r") as file1,\
open("dispn_tokenized.csv","w") as file2,\
open("dispn_tokenized_count.csv","w") as file3:
mycsv = list(csv.reader(file1))
words = []
words_set = []
tokenize_count = {}
for row in mycsv:
lst = tknzr.tokenize(row[2])
for l in lst:
file2.write("\""+str(row[2])+"\""+","+"\""+str(l.encode('utf-8'))+"\""+"\n")
l = l.lower()
words.append(l)
words_set = list(set(words))
print "len of words_set : " + str(len(words_set))
for word in words_set:
tokenize_count[word] = 1
for word in words:
tokenize_count[word] = tokenize_count[word]+1
print "len of tokenized words_set : " + str(len(tokenize_count))
#print "Tokenized_words count : "
#print tokenize_count
#print "================================================================="
i = 0
for wrd in words_set:
#i = i+1
print "i : " +str(i)
file3.write("\""+str(i)+"\""+","+"\""+str(wrd.encode('utf-8'))+"\""+","+"\""+str(tokenize_count[wrd])+"\""+"\n")
但在csv中我仍然发现了一些重复值,如1,5,4,7,9。
该方法的一些信息:
- dispn.csv = contains usernames of the users
which i am tokenizing with the help of nltk module
- after tokenizing them, i am storing these words in the list 'words'
and writing the words corresponding to the username to csv.
- creating set of it so as to get unique values out of list 'words'
and storing it in 'words_set'
- then creating dictionary 'tokenize_count' with key as word and
value as its frequency count and writing the same to csv.
为什么我只重复了一些数值?有没有更好的方法来做同样的事情?请帮忙。