如何使用Python来sumarize flattables

时间:2016-05-09 19:30:19

标签: python pandas dataframe group-by grouping

我从Excel阅读销售交易表,我有兴趣知道每个地点出售第一批商品后1小时内的销售数量。而且,我想他们有多少是通过卡与现金/ 设A为销售报告,我想创建B。

A=
item    Location    Time        Payment
X       Canada      10:03:18    CreditC
X       Canada      10:08:38    Cash
X       Canada      10:24:46    Cash
X       Canada      11:16:35    Cash
X       US          10:00:16    Cash
X       US          11:52:12    CreditC
Y       Canada      2:08:38     CreditC
Y       Canada      4:01:48     Cash
Y       US          13:32:02    CreditC
Y       US          14:07:03    Cash

item    location    first sale  count   CreditCard  Cash
X       Canada      10:03:18    3       1           2
X       US          10:00:16    1       0           1
Y       Canada      2:08:38     1       1           0
Y       US          13:32:02    2       1           1

我做了这个,这给了我第6行和第6行的错误9.我写了一些可以完成这项工作的变通办法,但我想知道最好的办法是什么。

#group the transactions within the time interval
df['start'] = pd.to_datetime(df['Time'])
grouped = df.groupby(['item', 'Location', 'Time'])
df['end'] = (grouped['start'].transform(lambda grp: grp.min()+pd.Timedelta(minutes=interval)))
df['count'] = (df['start'] < df['end'])
df['CreditCard'] = (df.Payment.map(len) == 7 and df['start'] < df['end'])

Summary =  pd.DataFrame(grouped['count'].sum()).reset_index()
Summary['CreditCard']=pd.Sereis(grouped['CreditCard'].sum(), index=Summary.index)  

3 个答案:

答案 0 :(得分:0)

您可以使用pd.crosstab生成频率表:

import numpy as np
import pandas as pd

df = pd.DataFrame({'Location': ['Canada', 'Canada', 'Canada', 'Canada', 'US', 'US', 'Canada', 'Canada', 'US', 'US'], 'Payment': ['CreditC', 'Cash', 'Cash', 'Cash', 'Cash', 'CreditC', 'CreditC', 'Cash', 'CreditC', 'Cash'], 'Time': ['10:03:18', '10:08:38', '10:24:46', '11:16:35', '10:00:16', '11:52:12', '2:08:38', '4:01:48', '13:32:02', '14:07:03'], 'item': ['X', 'X', 'X', 'X', 'X', 'X', 'Y', 'Y', 'Y', 'Y']}) 

df['start'] = pd.to_datetime(df['Time'])
grouped = df.groupby(['item', 'Location'])
interval = 60
df['end'] = (grouped['start'].transform(lambda grp: grp.min()+pd.Timedelta(minutes=interval)))

# isolate just the rows where the transaction occurs within an hour of first sale
df2 = df.loc[(df['start'] < df['end'])]
result = pd.crosstab(index=[df2['item'], df2['Location']], columns=[df2['Payment']])
result['count'] = result['Cash'] + result['CreditC']
result['first sale'] = grouped['Time'].first()

产量

Payment        Cash  CreditC  count first sale
item Location                                 
X    Canada       2        1      3   10:03:18
     US           1        0      1   10:00:16
Y    Canada       0        1      1    2:08:38
     US           1        1      2   13:32:02

答案 1 :(得分:0)

interval = 60  # minutes
df.sort_values('Time', inplace=True)
gb = df.groupby(['item', 'Location'], sort=False).apply(
    lambda group: group[group.Time <= 
                        group.Time.iat[0] + pd.Timedelta(minutes=interval)].Payment)
gb = gb.reset_index().groupby(['item', 'Location']).Payment.value_counts()
gb = gb.unstack('Payment').fillna(0)
gb['count'] = gb.sum(axis=1)
>>> gb

Payment        Cash  CreditC  count
item Location                      
X    Canada       2        1      3
     US           1        0      1
Y    Canada       0        1      1
     US           1        1      2

答案 2 :(得分:0)

解决方案

import datetime as dt

def first_hour(x):
    start = x.iloc[0]['Time']
    end = start + dt.timedelta(hours=1)
    df = x[(start <= x.Time) & (x.Time <= end)].groupby('Payment').count().T
    df['count'] = df.sum()
    df['first sale'] = start
    return df.iloc[[0]]

B = A.groupby(['item', 'Location']).apply(first_hour).fillna(0)

B = B.reset_index()[['item', 'Location', 'first sale', 'count', 'CreditC', 'Cash']]

  item Location          first sale  count  CreditC  Cash
0    X   Canada 2016-05-09 10:03:18    0.0      1.0   2.0
1    X       US 2016-05-09 10:00:16    0.0      0.0   1.0
2    Y   Canada 2016-05-09 02:08:38    0.0      1.0   0.0
3    Y       US 2016-05-09 13:32:02    0.0      1.0   1.0