Question

我有一个excel表格数据如下，

Sheet 1中

duration date
10      5/20/2017 08:20 
23      5/20/2017 10:20
33      5/21/2017 12:20
56      5/22/2017 23:20

Sheet 2中

duration date
34      5/20/2017 01:20 
12      5/20/2017 03:20
05      5/21/2017 11:20
44      5/22/2017 23:20

预期OP：

day[20] : [33, 46]
day[21] : [33, 12]
day[22] : [56, 44]

我试图在所有表格中明确持续时间的总和，如下面的代码，

xls = pd.ExcelFile('reports.xlsx')
    report_sheets = []
    for sheetName in xls.sheet_names:
        sheet = pd.read_excel(xls,sheet_name=sheetName)
        sheet['date'] = pd.to_datetime(sheet['date'])
        print(sheet.groupby(sheet['date'].dt.strftime('%Y-%m-%d'))['duration'].sum().sort_values())

我怎样才能做到这一点？

Answer 1

您可以使用参数sheet_name=False向read_excel返回dictionary of DataFrame：

dfs = pd.read_excel('reports.xlsx', sheet_name=None)
print (dfs)
OrderedDict([('Sheet1',    duration             date
0        10  5/20/2017 08:20
1        23  5/20/2017 10:20
2        33  5/21/2017 12:20
3        56  5/22/2017 23:20), ('Sheet2',    duration             date
0        34  5/20/2017 01:20
1        12  5/20/2017 03:20
2         5  5/21/2017 11:20
3        44  5/22/2017 23:20)])

然后在字典理解中聚合：

dfs1 = {i:x.groupby(pd.to_datetime(x['date']).dt.strftime('%Y-%m-%d'))['duration'].sum() for i, x in dfs.items()}
print (dfs1)
{'Sheet2': date
2017-05-20    46
2017-05-21     5
2017-05-22    44
Name: duration, dtype: int64, 'Sheet1': date
2017-05-20    33
2017-05-21    33
2017-05-22    56
Name: duration, dtype: int64}

上次concat，按to_dict创建list s和最后一本字典：

d = pd.concat(dfs1).groupby(level=1).apply(list).to_dict()
print (d)
{'2017-05-22': [56, 44], '2017-05-21': [33, 5], '2017-05-20': [33, 46]}

Answer 2

创建一个获取工作表数据框并返回字典的函数

def make_goofy_dict(d):
    d = d.set_index('date').duration.resample('D').sum()
    return d.apply(lambda x: [x]).to_dict()

然后使用merge_with或toolz

中的cytoolz

from cytoolz.dicttoolz import merge_with

merge_with(lambda x: sum(x, []), map(make_goofy_dict, (sheet1, sheet2)))

{Timestamp('2017-05-20 00:00:00', freq='D'): [33, 46],
 Timestamp('2017-05-21 00:00:00', freq='D'): [33, 5],
 Timestamp('2017-05-22 00:00:00', freq='D'): [56, 44]}

详情

print(sheet1, sheet2, sep='\n\n')

   duration                date
0        10 2017-05-20 08:20:00
1        23 2017-05-20 10:20:00
2        33 2017-05-21 12:20:00
3        56 2017-05-22 23:20:00

   duration                date
0        34 2017-05-20 01:20:00
1        12 2017-05-20 03:20:00
2         5 2017-05-21 11:20:00
3        44 2017-05-22 23:20:00

为您的问题
我会这样做

from cytoolz.dicttoolz import merge_with


def make_goofy_dict(d):
    d = d.set_index('date').duration.resample('D').sum()
    return d.apply(lambda x: [x]).to_dict()

def read_sheet(xls, sn):
    return pd.read_excel(xls, sheet_name=sn, parse_dates=['date'])


xls = pd.ExcelFile('reports.xlsx')

sheet_dict = merge_with(
    lambda x: sum(x, []),
    map(make_goofy_dict, map(read_sheet, xls.sheet_names))
)

Python - 多个excel表中每天的总和

2 个答案: