在csv excel文件中对相关数据进行分组

时间:2016-10-10 11:47:50

标签: python regex grouping

这是一个csv excel文件

   Receipt Name    Address      Date       Time    Total
    25007   A      ABC pte ltd   3/7/2016   10:40   12.30
    25008   A      ABC ptd ltd   3/7/2016   11.30   6.70
    25009   B      CCC ptd ltd   4/7/2016   07.35   23.40
    25010   A      ABC pte ltd   4/7/2016   12:40   9.90

如何检索日期和时间并将它们分别分别为公司A和B,以便输出类似于:(A,3/7 / 2016,10:40,11:30,2016年4月7日12 :40),(B,4/7 / 2016,07:35)

我现有的代码是:

datePattern = re.compile(r"(\d+/\d+/\d+)\s+(\d+:\d+)")  
dateDict =dict()    

for i, line in enumerate(open('sample_data.csv')):
    for match in re.finditer(datePattern,line):
        if match.group(1) in dateDict:
            dateDict[match.group(1)].append(match.group(2))
        else:
            dateDict[match.group(1)] = [match.group(2),]

然而,它仅适用于分组日期和时间,但现在我想将名称作为分组的一部分。 *首选使用csv模块

3 个答案:

答案 0 :(得分:0)

假设您的数据实际上如下:

Receipt,Name,Address,Date,Time,Items
25007,A,ABC pte ltd,4/7/2016,10:40,"Cheese, Cookie, Pie"
25008,A,CCC pte ltd,4/7/2016,11:30,"Cheese, Cookie"
25009,B,CCC pte ltd,4/7/2016,07:35,"Chocolate"
25010,A,CCC pte ltd,4/7/2016,12:40," Butter, Cookie"

然后分组非常简单:

from collections import defaultdict
from csv import reader
with open("test.csv") as f:
    next(f) # skip header
    group_dict = defaultdict(list)
    for _, name, _, dte, time, _ in reader(f):
        group_dict[name].append((dte, time))

from  pprint import pprint as pp

pp(dict(group_dict))

会给你:

'A': [('4/7/2016', '10:40'), ('4/7/2016', '11:30'), ('4/7/2016', '12:40')],
 'B': [('4/7/2016', '07:35')]}

如果您不希望日期重复,那么也可以分组:

with open("test.csv") as f:
    next(f) # skip header
    group_dict = defaultdict(list)
    for _, name, _, dte, time, _ in reader(f):
        group_dict[name, dte].append(time)

from  pprint import pprint as pp

pp(dict(group_dict))

哪会给你:

{('A', '4/7/2016'): ['10:40', '11:30', '12:40'], ('B', '4/7/2016'): ['07:35']}

答案 1 :(得分:-1)

使用Pandas模块可以很容易地完成:

import pandas as pd

df = pd.read_csv('/path/to/file.csv')

df.groupby(['Name','Date']).Time.apply(list).reset_index().to_csv('d:/temp/out.csv', index=False)

d:\ TEMP \ out.csv:

Name,Date,Time
A,3/7/2016,"['10:40', '11.30']"
A,4/7/2016,['12:40']
B,4/7/2016,['07.35']

答案 2 :(得分:-1)

如果您不想使用Pandas,这是一种可能的解决方案。它不是最优雅的,因为你的csv格式相对笨重的解析。如果您可以更改格式以使用非空白字段分隔符,则最好使用正确的csv解析库(如pandas或Python的内置csv模块)。

import re

datePattern = re.compile(r"(\d+/\d+/\d+)\s+(\d+[:.]\d+)")
companyPattern = re.compile(r"^\s+\d+\s+(\w+)")
companyDict = {}

for i, line in enumerate(open('sample_data.csv')):
    # skip csv header
    if i == 0:
        continue

    timestampMatch = datePattern.search(line)
    companyMatch   = companyPattern.search(line)

    # filter out any malformed lines which don't match
    if timestampMatch is None or companyMatch is None:
        continue

    date = timestampMatch.group(1)
    time = timestampMatch.group(2)
    company = companyMatch.group(1)

    companyDict.setdefault(company, []).append("{} {}".format(date, time))

请注意,时间字段与小时/分钟分隔符是否使用.:不一致,所以我考虑了这一点。

在您的示例数据上运行此操作会产生companyDict的以下值:

{'A': ['3/7/2016 10:40', '3/7/2016 11.30', '4/7/2016 12:40'], 'B': ['4/7/2016 07.35']}