这是一个csv excel文件
Receipt Name Address Date Time Total
25007 A ABC pte ltd 3/7/2016 10:40 12.30
25008 A ABC ptd ltd 3/7/2016 11.30 6.70
25009 B CCC ptd ltd 4/7/2016 07.35 23.40
25010 A ABC pte ltd 4/7/2016 12:40 9.90
如何检索日期和时间并将它们分别分别为公司A和B,以便输出类似于:(A,3/7 / 2016,10:40,11:30,2016年4月7日12 :40),(B,4/7 / 2016,07:35)
我现有的代码是:
datePattern = re.compile(r"(\d+/\d+/\d+)\s+(\d+:\d+)")
dateDict =dict()
for i, line in enumerate(open('sample_data.csv')):
for match in re.finditer(datePattern,line):
if match.group(1) in dateDict:
dateDict[match.group(1)].append(match.group(2))
else:
dateDict[match.group(1)] = [match.group(2),]
然而,它仅适用于分组日期和时间,但现在我想将名称作为分组的一部分。 *首选使用csv模块
答案 0 :(得分:0)
假设您的数据实际上如下:
Receipt,Name,Address,Date,Time,Items
25007,A,ABC pte ltd,4/7/2016,10:40,"Cheese, Cookie, Pie"
25008,A,CCC pte ltd,4/7/2016,11:30,"Cheese, Cookie"
25009,B,CCC pte ltd,4/7/2016,07:35,"Chocolate"
25010,A,CCC pte ltd,4/7/2016,12:40," Butter, Cookie"
然后分组非常简单:
from collections import defaultdict
from csv import reader
with open("test.csv") as f:
next(f) # skip header
group_dict = defaultdict(list)
for _, name, _, dte, time, _ in reader(f):
group_dict[name].append((dte, time))
from pprint import pprint as pp
pp(dict(group_dict))
会给你:
'A': [('4/7/2016', '10:40'), ('4/7/2016', '11:30'), ('4/7/2016', '12:40')],
'B': [('4/7/2016', '07:35')]}
如果您不希望日期重复,那么也可以分组:
with open("test.csv") as f:
next(f) # skip header
group_dict = defaultdict(list)
for _, name, _, dte, time, _ in reader(f):
group_dict[name, dte].append(time)
from pprint import pprint as pp
pp(dict(group_dict))
哪会给你:
{('A', '4/7/2016'): ['10:40', '11:30', '12:40'], ('B', '4/7/2016'): ['07:35']}
答案 1 :(得分:-1)
使用Pandas模块可以很容易地完成:
import pandas as pd
df = pd.read_csv('/path/to/file.csv')
df.groupby(['Name','Date']).Time.apply(list).reset_index().to_csv('d:/temp/out.csv', index=False)
d:\ TEMP \ out.csv:
Name,Date,Time
A,3/7/2016,"['10:40', '11.30']"
A,4/7/2016,['12:40']
B,4/7/2016,['07.35']
答案 2 :(得分:-1)
如果您不想使用Pandas,这是一种可能的解决方案。它不是最优雅的,因为你的csv格式相对笨重的解析。如果您可以更改格式以使用非空白字段分隔符,则最好使用正确的csv解析库(如pandas
或Python的内置csv
模块)。
import re
datePattern = re.compile(r"(\d+/\d+/\d+)\s+(\d+[:.]\d+)")
companyPattern = re.compile(r"^\s+\d+\s+(\w+)")
companyDict = {}
for i, line in enumerate(open('sample_data.csv')):
# skip csv header
if i == 0:
continue
timestampMatch = datePattern.search(line)
companyMatch = companyPattern.search(line)
# filter out any malformed lines which don't match
if timestampMatch is None or companyMatch is None:
continue
date = timestampMatch.group(1)
time = timestampMatch.group(2)
company = companyMatch.group(1)
companyDict.setdefault(company, []).append("{} {}".format(date, time))
请注意,时间字段与小时/分钟分隔符是否使用.
或:
不一致,所以我考虑了这一点。
在您的示例数据上运行此操作会产生companyDict
的以下值:
{'A': ['3/7/2016 10:40', '3/7/2016 11.30', '4/7/2016 12:40'], 'B': ['4/7/2016 07.35']}