给定其他项目,计算列表中的项目出现

时间:2018-09-05 15:22:12

标签: python

我编写了以下python代码来解析.csv文件并打印两列,日期和等级。现在,我想根据日期对评分进行计数,例如,如果2018-4-01出现了4次,而我要打印的评分为1,4,1,4

2018-4-01 1 2
2018-4-01 4 2

我尝试过的代码

import glob
import csv
import re
from collections import Counter
path = "ReviewsSep2018/*.csv"
mylist = []
    for filename in glob.glob(path):
    print(filename)
    with open(filename, newline='', encoding='utf-16') as f:
        reader = csv.reader(f)
        for row in reader:
            result = re.search(r'\d+\W\d+\W\d+', row[5])
            if result:
                line = result.group()
                mylist.append(tuple([line,row[9]]))
        print(mylist)
for i in mylist:
    print(i[0],i[1])

代码示例的输出

2018-09-01 1
2018-09-01 5
2018-09-01 2
2018-09-01 1
2018-08-23 1
2018-09-01 4
2018-09-01 4
2018-09-01 5
2018-09-01 2
2018-09-02 1
2018-09-02 5
2018-09-02 5

所需结果

date       star   count
2018-09-01   1        2
2018-09-01   2        3
2018-09-01   5        2
2018-09-02   5        2
2018-08-23   1        1

2 个答案:

答案 0 :(得分:0)

只需将您的mylist变成Counter

mycount = Counter()

而不是附加到(date, rating)元组的列表增量计数:

mycount[(line,row[9])] += 1

最后,将其显示为:

for (date, rating), count in mycount.items():
    print(date, rating, count)

答案 1 :(得分:0)

如果您不介意使用熊猫库,则可以在解析数据后使用groupby。在我看来,熊猫还具有良好的.csv阅读功能。

import pandas as pd

(pd.DataFrame([['2018-09-01', 1],
              ['2018-09-01', 5],
              ['2018-09-01', 2],
              ['2018-09-01', 1],
              ['2018-08-23', 1],
              ['2018-09-01', 4],
              ['2018-09-01', 4],
              ['2018-09-01', 5],
              ['2018-09-01', 2],
              ['2018-09-02', 1],
              ['2018-09-02', 5],
              ['2018-09-02', 5]],
             columns=['date', 'star']
            )
 .assign(count=1)
 .groupby(['date', 'star'])
 .count()
)