用groupby

时间:2015-08-28 22:09:22

标签: pandas

我不知道为什么我无法解决这个问题。我试图根据客户ID列找到特定时间帧之间的行数。我感兴趣的时段是Call_Time,7天和3天后的14天。

df =

Call_Time   Customer_ID  Survey
    8/26/2015   aaa123    1
    8/27/2015   bbb222    1

dataframe fcr =

Call_Time   Customer_ID
8/14/2015   aaa123
8/7/2015    aaa123
7/15/2015   aaa123
8/22/2015   aaa123
8/3/2015    bbb222
8/8/2015    bbb222
8/10/2015   bbb222

以下是我现在使用的代码

fcr['Total_Hits'] = 1
g14 = fcr.groupby([pd.Grouper(freq='14D',key='Call_Time'),'Customer_ID']).sum()
g7 = fcr.groupby([pd.Grouper(freq='7D',key='Call_Time'),'Customer_ID']).sum()
g3 = fcr.groupby([pd.Grouper(freq='3D',key='Call_Time'),'Customer_ID']).sum()

然后我想将这些值从一个单独的文件连接到另一个数据帧。

temp = pd.merge(g14, g7, how ='left', on = ['Call_Time', 'Customer_ID'])
previous_hits = pd.merge(temp, g3, how ='left', on = ['Call_Time', 'Customer_ID'])
df2 = pd.merge(df, previous_hits, how ='left', on = ['Call_Time', 'Customer_ID'])

所以我的df2会将所有通话记录(fcr)与原始df(调查结果)相结合。我想知道的是,对于每位填写调查问卷的客户,他们在14天,7天或3天内填写调查表之前打了几次电话?对于多次打电话的客户,分数是否较低?

1 个答案:

答案 0 :(得分:1)

您可以使用自定义Grouper Doc example

首先我必须使用函数pd.to_datetime,因为我的专栏Call_Time不是datetime64 dtype。然后,我添加标量为Count的{​​{1}}列,并按建议的频率和列1求和。

Customer_ID

Grouper排序日期列本身,您可以使用import pandas as pd import io temp=u"""Call_Time,Customer_ID 8/14/2015 0:00,aaa123 8/7/2015 0:00,aaa123 7/15/2015 0:00,aaa123 8/22/2015 0:00,aaa123 8/3/2015 0:00,bbb222 8/8/2015 0:00,bbb222 8/10/2015 0:00,bbb222""" df = pd.read_csv(io.StringIO(temp), parse_dates=True) #time format - http://strftime.org/ df['Call_Time'] = pd.to_datetime(df['Call_Time'], format='%m/%d/%Y %H:%M') #set column quantity - each time user call once df["Count"] = 1 print df # # Call_Time Customer_ID Count #0 2015-08-14 aaa123 1 #1 2015-08-07 aaa123 1 #2 2015-07-15 aaa123 1 #3 2015-08-22 aaa123 1 #4 2015-08-03 bbb222 1 #5 2015-08-08 bbb222 1 #6 2015-08-10 bbb222 1 # #grouping by frequency and Customer_ID g14 = df.groupby([pd.Grouper(freq='14D',key='Call_Time'),'Customer_ID']).sum() g7 = df.groupby([pd.Grouper(freq='7D',key='Call_Time'),'Customer_ID']).sum() g3 = df.groupby([pd.Grouper(freq='3D',key='Call_Time'),'Customer_ID']).sum() print g14 print g7 print g3 # # Count #Call_Time Customer_ID #2015-07-15 aaa123 1 #2015-07-29 aaa123 1 # bbb222 3 #2015-08-12 aaa123 2 # Count #Call_Time Customer_ID #2015-07-15 aaa123 1 #2015-07-29 bbb222 1 #2015-08-05 aaa123 1 # bbb222 2 #2015-08-12 aaa123 1 #2015-08-19 aaa123 1 # Count #Call_Time Customer_ID #2015-07-15 aaa123 1 #2015-08-02 bbb222 1 #2015-08-05 aaa123 1 #2015-08-08 bbb222 2 #2015-08-14 aaa123 1 #2015-08-20 aaa123 1 排序Call_Time仅用于检查数据:

Customer_ID

修改 我不知道我理解正确,所以我修改了我的解决方案:

df =  df.groupby("Customer_ID", sort=False).apply(lambda x: x.sort('Call_Time')).drop('Customer_ID', axis=1)
df = df.reset_index(0) 

我在import pandas as pd import io temp1=u"""Call_Time,Customer_ID,Survey 8/26/2015,aaa123,1 8/27/2015,bbb222,1""" temp=u"""Call_Time,Customer_ID 8/14/2015 0:00,aaa123 8/7/2015 0:00,aaa123 7/15/2015 0:00,aaa123 8/22/2015 0:00,aaa123 8/3/2015 0:00,bbb222 8/8/2015 0:00,bbb222 8/10/2015 0:00,bbb222""" fcr = pd.read_csv(io.StringIO(temp), parse_dates=True) df = pd.read_csv(io.StringIO(temp1), parse_dates=True) fcr['Call_Time'] = pd.to_datetime(fcr['Call_Time'], format='%m/%d/%Y %H:%M') df['Call_Time'] = pd.to_datetime(df['Call_Time'], format='%m/%d/%Y') fcr['Total_Hits'] = 1 g14 = fcr.groupby([pd.Grouper(freq='14D',key='Call_Time'),'Customer_ID']).sum().reset_index() g7 = fcr.groupby([pd.Grouper(freq='7D',key='Call_Time'),'Customer_ID']).sum().reset_index() g3 = fcr.groupby([pd.Grouper(freq='3D',key='Call_Time'),'Customer_ID']).sum().reset_index() g14 = g14.rename(columns={'Total_Hits':'Total_Hits_14'}) g7 = g7.rename(columns={'Total_Hits':'Total_Hits_7'}) g3 = g3.rename(columns={'Total_Hits':'Total_Hits_3'}) temp = pd.merge(g14, g7, how ='outer', on = ['Call_Time', 'Customer_ID']) previous_hits = pd.merge(temp, g3, how ='outer', on = ['Call_Time', 'Customer_ID']) df2 = pd.merge(df, previous_hits, how ='left', on = ['Customer_ID']) df2 = df2.rename(columns={'Call_Time_x':'Call_Time', 'Call_Time_y':'Call_Time_fcr'}) print df2 # # Call_Time Customer_ID Survey Call_Time_fcr Total_Hits_14 Total_Hits_7 \ #0 2015-08-26 aaa123 1 2015-07-15 1 1 #1 2015-08-26 aaa123 1 2015-07-29 1 NaN #2 2015-08-26 aaa123 1 2015-08-12 2 1 #3 2015-08-26 aaa123 1 2015-08-05 NaN 1 #4 2015-08-26 aaa123 1 2015-08-19 NaN 1 #5 2015-08-26 aaa123 1 2015-08-14 NaN NaN #6 2015-08-26 aaa123 1 2015-08-20 NaN NaN #7 2015-08-27 bbb222 1 2015-07-29 3 1 #8 2015-08-27 bbb222 1 2015-08-05 NaN 2 #9 2015-08-27 bbb222 1 2015-08-02 NaN NaN #10 2015-08-27 bbb222 1 2015-08-08 NaN NaN # Total_Hits_3 #0 1 #1 NaN #2 NaN #3 1 #4 NaN #5 1 #6 1 #7 NaN #8 NaN #9 1 #10 2 g14g7追加g3重置了多索引。
我认为在数据帧reset_index()中更好的是使用来自两个帧(previous_hits)的键的联合 为了优雅,我重命名了多个列。没有必要。