如何阅读和分组此CSV数据?

时间:2014-09-26 01:33:50

标签: python csv python-ggplot

csv看起来像这样。 ' |'意思是不同的列。

2014-09-01 | I love chicken

2014-09-01 | I eat chicken

2014-09-02 | She loves chicken

2014-09-02 | Ha ha ha I love chicken

2014-09-03 | Blah Blah Blah

我想对数据进行处理,看起来像这样。

2014-09-01 | 'i', 2 | 'love', 1 | 'chicken', 2 | 'eat', 1 |

2014-09-02 | 'she', 1 | 'love', 2 | 'chicken', 2 | 'ha', 3 | 'I', 1 |

2014-09-03 | 'blah', 3 |

DATE | WORD, WORDCOUNTS | WORD2, WORDCOUNTS2 | ...

我应该在这里使用什么方法?我最终想绘制一个图表,显示x轴上的日期和y轴上的字数(频率)。

以下是我最好的方法。

TestStartDate = "2013-11-11"
TestEndDate = "2014-06-10"

with open('Simplified.csv') as f:
    reader = csv.reader(f)
    for row in reader:
        if str(row[0:1])[2:12] == TestStartDate:
            #str(row[1:2])[2:str(row[1:2]).find('"')-1] is the second column
            tagger = MeCab.Tagger()
            rose = tagger.parse(str(row[1:2])[2:str(row[1:2]).find('"')-1])
            #print rose
            wordCount = {}
            wordList = rose.split()[:-1:2]
            for word in wordList:
                wordCount.setdefault(word, 0)
                wordCount[word] += 1
            for word, count in wordCount.items():
                print '"%s, %i"' % (word, count)

我打算在数据中添加单词和计数。

4 个答案:

答案 0 :(得分:0)

这对我有用〜你真的需要最后一个'|' ?因为当你用'|'拆分时再次当你把它放入matplotlib或其他东西时,你的结果会得到一个'。

下面的代码不会附加“|”对于每行结果,如果您认为有必要,只需附加一个'|'到函数d,像这样:

return '%s| %s|'%(tokens[0],'|'.join(["'%s',%s"%(word,words.count(word)) for word in set(words)]))

===========

def d(s):
    tokens = s.split('|')
    words = tokens[-1].strip().lower().split(' ')
    return '%s| %s'%(tokens[0],'|'.join(["'%s',%s"%(word,words.count(word)) for word in set(words)]))

def wordcount():
    lines=[
        '2014-09-01 | I love chicken',
        '2014-09-01 | I eat chicken',
        '2014-09-02 | She loves chicken',
        '2014-09-02 | Ha ha ha I love chicken',
        '2014-09-03 | Blah Blah Blah'
    ]
    rows={}
    for line in lines:
        t_line = line.split(' | ')
        if t_line[0] not in rows:
            rows[t_line[0]]=''
        rows[t_line[0]]+=(' '+t_line[-1])
    newrows=[]
    for k,v in rows.items():
        newrows.append(d('%s | %s'%(k,v)))
    print '\n'.join(newrows)


>>2014-09-02 | 'love',1|'i',1|'she',1|'loves',1|'chicken',2|'ha',3
>>2014-09-03 | 'blah',3
>>2014-09-01 | 'i',2|'chicken',2|'love',1|'eat',1

答案 1 :(得分:0)

阅读输入CSV,创建一个将日期映射到Counter的字典。使用该行中的单词更新每行给定数据的计数器。然后写出[date,(word1,count1),(word2,count2),...]形式的行。此示例对日期和单词进行排序,但您可以省略它以获得更好的性能。

from collections import Counter
import csv

data = {}

with open('my_data.csv') as f:
    for date, words in csv.reader(f, delimiter='|'):
        data.setdefault(date, Counter()).update(word for word in words.split())

with open('my_counts.csv', 'w') as f:
    writer = csv.writer(f, delimiter='|')

    for date in sorted(data.keys()):
        writer.writerow([date] + ["'{0}', {1}".format(date, data[date]) for date in sorted(data.keys())])

答案 2 :(得分:0)

我建议使用Counter进行计数。

import re
from collections import Counter

stats = {}

with open('in.txt' ,'r') as fin:
    for line in fin:
        tokens = re.split('[\| ]', line)
        key = tokens.pop(0)
        counter = Counter()
        for token in tokens:
            counter[token] = counter[token] + 1
        if key in stats:
            stats[key] = stats[key] + counter
        else:
            stats[key] = counter

for key, counter in stats.items():
    print key, '|', '|'.join([ '"%s", %s' % (k,v) for k,v in counter.items() ]), '|'

答案 3 :(得分:0)

以下是使用defaultdictCounter个集合的解决方案。

import csv
from collections import defaultdict
from collections import Counter


date_words = defaultdict(lambda: Counter())


with open('test.csv') as psvfile:
    reader = csv.reader(psvfile, delimiter="|")

    for line in reader:
        date = line[0]
        words = line[1].split()

        date_words[date].update(Counter(words))

您可能还想考虑使用擅长处理日期和绘制内容的pandas库。