我编写了以下python代码来解析.csv文件并打印两列,日期和等级。现在,我想根据日期对评分进行计数,例如,如果2018-4-01
出现了4次,而我要打印的评分为1,4,1,4
2018-4-01 1 2
2018-4-01 4 2
我尝试过的代码
import glob
import csv
import re
from collections import Counter
path = "ReviewsSep2018/*.csv"
mylist = []
for filename in glob.glob(path):
print(filename)
with open(filename, newline='', encoding='utf-16') as f:
reader = csv.reader(f)
for row in reader:
result = re.search(r'\d+\W\d+\W\d+', row[5])
if result:
line = result.group()
mylist.append(tuple([line,row[9]]))
print(mylist)
for i in mylist:
print(i[0],i[1])
代码示例的输出
2018-09-01 1
2018-09-01 5
2018-09-01 2
2018-09-01 1
2018-08-23 1
2018-09-01 4
2018-09-01 4
2018-09-01 5
2018-09-01 2
2018-09-02 1
2018-09-02 5
2018-09-02 5
所需结果
date star count
2018-09-01 1 2
2018-09-01 2 3
2018-09-01 5 2
2018-09-02 5 2
2018-08-23 1 1
答案 0 :(得分:0)
只需将您的mylist
变成Counter
mycount = Counter()
而不是附加到(date, rating)
元组的列表增量计数:
mycount[(line,row[9])] += 1
最后,将其显示为:
for (date, rating), count in mycount.items():
print(date, rating, count)
答案 1 :(得分:0)
如果您不介意使用熊猫库,则可以在解析数据后使用groupby
。在我看来,熊猫还具有良好的.csv
阅读功能。
import pandas as pd
(pd.DataFrame([['2018-09-01', 1],
['2018-09-01', 5],
['2018-09-01', 2],
['2018-09-01', 1],
['2018-08-23', 1],
['2018-09-01', 4],
['2018-09-01', 4],
['2018-09-01', 5],
['2018-09-01', 2],
['2018-09-02', 1],
['2018-09-02', 5],
['2018-09-02', 5]],
columns=['date', 'star']
)
.assign(count=1)
.groupby(['date', 'star'])
.count()
)