csv看起来像这样。 ' |'意思是不同的列。
2014-09-01 | I love chicken
2014-09-01 | I eat chicken
2014-09-02 | She loves chicken
2014-09-02 | Ha ha ha I love chicken
2014-09-03 | Blah Blah Blah
我想对数据进行处理,看起来像这样。
2014-09-01 | 'i', 2 | 'love', 1 | 'chicken', 2 | 'eat', 1 |
2014-09-02 | 'she', 1 | 'love', 2 | 'chicken', 2 | 'ha', 3 | 'I', 1 |
2014-09-03 | 'blah', 3 |
DATE | WORD, WORDCOUNTS | WORD2, WORDCOUNTS2 | ...
我应该在这里使用什么方法?我最终想绘制一个图表,显示x轴上的日期和y轴上的字数(频率)。
以下是我最好的方法。
TestStartDate = "2013-11-11"
TestEndDate = "2014-06-10"
with open('Simplified.csv') as f:
reader = csv.reader(f)
for row in reader:
if str(row[0:1])[2:12] == TestStartDate:
#str(row[1:2])[2:str(row[1:2]).find('"')-1] is the second column
tagger = MeCab.Tagger()
rose = tagger.parse(str(row[1:2])[2:str(row[1:2]).find('"')-1])
#print rose
wordCount = {}
wordList = rose.split()[:-1:2]
for word in wordList:
wordCount.setdefault(word, 0)
wordCount[word] += 1
for word, count in wordCount.items():
print '"%s, %i"' % (word, count)
我打算在数据中添加单词和计数。
答案 0 :(得分:0)
这对我有用〜你真的需要最后一个'|' ?因为当你用'|'拆分时再次当你把它放入matplotlib或其他东西时,你的结果会得到一个'。
下面的代码不会附加“|”对于每行结果,如果您认为有必要,只需附加一个'|'到函数d,像这样:
return '%s| %s|'%(tokens[0],'|'.join(["'%s',%s"%(word,words.count(word)) for word in set(words)]))
===========
def d(s):
tokens = s.split('|')
words = tokens[-1].strip().lower().split(' ')
return '%s| %s'%(tokens[0],'|'.join(["'%s',%s"%(word,words.count(word)) for word in set(words)]))
def wordcount():
lines=[
'2014-09-01 | I love chicken',
'2014-09-01 | I eat chicken',
'2014-09-02 | She loves chicken',
'2014-09-02 | Ha ha ha I love chicken',
'2014-09-03 | Blah Blah Blah'
]
rows={}
for line in lines:
t_line = line.split(' | ')
if t_line[0] not in rows:
rows[t_line[0]]=''
rows[t_line[0]]+=(' '+t_line[-1])
newrows=[]
for k,v in rows.items():
newrows.append(d('%s | %s'%(k,v)))
print '\n'.join(newrows)
>>2014-09-02 | 'love',1|'i',1|'she',1|'loves',1|'chicken',2|'ha',3
>>2014-09-03 | 'blah',3
>>2014-09-01 | 'i',2|'chicken',2|'love',1|'eat',1
答案 1 :(得分:0)
阅读输入CSV,创建一个将日期映射到Counter
的字典。使用该行中的单词更新每行给定数据的计数器。然后写出[date,(word1,count1),(word2,count2),...]形式的行。此示例对日期和单词进行排序,但您可以省略它以获得更好的性能。
from collections import Counter
import csv
data = {}
with open('my_data.csv') as f:
for date, words in csv.reader(f, delimiter='|'):
data.setdefault(date, Counter()).update(word for word in words.split())
with open('my_counts.csv', 'w') as f:
writer = csv.writer(f, delimiter='|')
for date in sorted(data.keys()):
writer.writerow([date] + ["'{0}', {1}".format(date, data[date]) for date in sorted(data.keys())])
答案 2 :(得分:0)
我建议使用Counter
进行计数。
import re
from collections import Counter
stats = {}
with open('in.txt' ,'r') as fin:
for line in fin:
tokens = re.split('[\| ]', line)
key = tokens.pop(0)
counter = Counter()
for token in tokens:
counter[token] = counter[token] + 1
if key in stats:
stats[key] = stats[key] + counter
else:
stats[key] = counter
for key, counter in stats.items():
print key, '|', '|'.join([ '"%s", %s' % (k,v) for k,v in counter.items() ]), '|'
答案 3 :(得分:0)
以下是使用defaultdict和Counter个集合的解决方案。
import csv
from collections import defaultdict
from collections import Counter
date_words = defaultdict(lambda: Counter())
with open('test.csv') as psvfile:
reader = csv.reader(psvfile, delimiter="|")
for line in reader:
date = line[0]
words = line[1].split()
date_words[date].update(Counter(words))
您可能还想考虑使用擅长处理日期和绘制内容的pandas库。