我有一个excel表格数据如下,
Sheet 1中
duration date
10 5/20/2017 08:20
23 5/20/2017 10:20
33 5/21/2017 12:20
56 5/22/2017 23:20
Sheet 2中
duration date
34 5/20/2017 01:20
12 5/20/2017 03:20
05 5/21/2017 11:20
44 5/22/2017 23:20
预期OP:
day[20] : [33, 46]
day[21] : [33, 12]
day[22] : [56, 44]
我试图在所有表格中明确持续时间的总和,如下面的代码,
xls = pd.ExcelFile('reports.xlsx')
report_sheets = []
for sheetName in xls.sheet_names:
sheet = pd.read_excel(xls,sheet_name=sheetName)
sheet['date'] = pd.to_datetime(sheet['date'])
print(sheet.groupby(sheet['date'].dt.strftime('%Y-%m-%d'))['duration'].sum().sort_values())
我怎样才能做到这一点?
答案 0 :(得分:2)
您可以使用参数sheet_name=False
向read_excel
返回dictionary of DataFrame
:
dfs = pd.read_excel('reports.xlsx', sheet_name=None)
print (dfs)
OrderedDict([('Sheet1', duration date
0 10 5/20/2017 08:20
1 23 5/20/2017 10:20
2 33 5/21/2017 12:20
3 56 5/22/2017 23:20), ('Sheet2', duration date
0 34 5/20/2017 01:20
1 12 5/20/2017 03:20
2 5 5/21/2017 11:20
3 44 5/22/2017 23:20)])
然后在字典理解中聚合:
dfs1 = {i:x.groupby(pd.to_datetime(x['date']).dt.strftime('%Y-%m-%d'))['duration'].sum() for i, x in dfs.items()}
print (dfs1)
{'Sheet2': date
2017-05-20 46
2017-05-21 5
2017-05-22 44
Name: duration, dtype: int64, 'Sheet1': date
2017-05-20 33
2017-05-21 33
2017-05-22 56
Name: duration, dtype: int64}
上次concat
,按to_dict
创建list
s和最后一本字典:
d = pd.concat(dfs1).groupby(level=1).apply(list).to_dict()
print (d)
{'2017-05-22': [56, 44], '2017-05-21': [33, 5], '2017-05-20': [33, 46]}
答案 1 :(得分:1)
创建一个获取工作表数据框并返回字典的函数
def make_goofy_dict(d):
d = d.set_index('date').duration.resample('D').sum()
return d.apply(lambda x: [x]).to_dict()
然后使用merge_with
或toolz
cytoolz
from cytoolz.dicttoolz import merge_with
merge_with(lambda x: sum(x, []), map(make_goofy_dict, (sheet1, sheet2)))
{Timestamp('2017-05-20 00:00:00', freq='D'): [33, 46],
Timestamp('2017-05-21 00:00:00', freq='D'): [33, 5],
Timestamp('2017-05-22 00:00:00', freq='D'): [56, 44]}
详情
print(sheet1, sheet2, sep='\n\n')
duration date
0 10 2017-05-20 08:20:00
1 23 2017-05-20 10:20:00
2 33 2017-05-21 12:20:00
3 56 2017-05-22 23:20:00
duration date
0 34 2017-05-20 01:20:00
1 12 2017-05-20 03:20:00
2 5 2017-05-21 11:20:00
3 44 2017-05-22 23:20:00
为您的问题
我会这样做
from cytoolz.dicttoolz import merge_with
def make_goofy_dict(d):
d = d.set_index('date').duration.resample('D').sum()
return d.apply(lambda x: [x]).to_dict()
def read_sheet(xls, sn):
return pd.read_excel(xls, sheet_name=sn, parse_dates=['date'])
xls = pd.ExcelFile('reports.xlsx')
sheet_dict = merge_with(
lambda x: sum(x, []),
map(make_goofy_dict, map(read_sheet, xls.sheet_names))
)