我不知道为什么我无法解决这个问题。我试图根据客户ID列找到特定时间帧之间的行数。我感兴趣的时段是Call_Time,7天和3天后的14天。
df =
Call_Time Customer_ID Survey
8/26/2015 aaa123 1
8/27/2015 bbb222 1
dataframe fcr =
Call_Time Customer_ID
8/14/2015 aaa123
8/7/2015 aaa123
7/15/2015 aaa123
8/22/2015 aaa123
8/3/2015 bbb222
8/8/2015 bbb222
8/10/2015 bbb222
以下是我现在使用的代码
fcr['Total_Hits'] = 1
g14 = fcr.groupby([pd.Grouper(freq='14D',key='Call_Time'),'Customer_ID']).sum()
g7 = fcr.groupby([pd.Grouper(freq='7D',key='Call_Time'),'Customer_ID']).sum()
g3 = fcr.groupby([pd.Grouper(freq='3D',key='Call_Time'),'Customer_ID']).sum()
然后我想将这些值从一个单独的文件连接到另一个数据帧。
temp = pd.merge(g14, g7, how ='left', on = ['Call_Time', 'Customer_ID'])
previous_hits = pd.merge(temp, g3, how ='left', on = ['Call_Time', 'Customer_ID'])
df2 = pd.merge(df, previous_hits, how ='left', on = ['Call_Time', 'Customer_ID'])
所以我的df2会将所有通话记录(fcr)与原始df(调查结果)相结合。我想知道的是,对于每位填写调查问卷的客户,他们在14天,7天或3天内填写调查表之前打了几次电话?对于多次打电话的客户,分数是否较低?
答案 0 :(得分:1)
您可以使用自定义Grouper
Doc example
首先我必须使用函数pd.to_datetime
,因为我的专栏Call_Time
不是datetime64 dtype
。然后,我添加标量为Count
的{{1}}列,并按建议的频率和列1
求和。
Customer_ID
Grouper排序日期列本身,您可以使用import pandas as pd
import io
temp=u"""Call_Time,Customer_ID
8/14/2015 0:00,aaa123
8/7/2015 0:00,aaa123
7/15/2015 0:00,aaa123
8/22/2015 0:00,aaa123
8/3/2015 0:00,bbb222
8/8/2015 0:00,bbb222
8/10/2015 0:00,bbb222"""
df = pd.read_csv(io.StringIO(temp), parse_dates=True)
#time format - http://strftime.org/
df['Call_Time'] = pd.to_datetime(df['Call_Time'], format='%m/%d/%Y %H:%M')
#set column quantity - each time user call once
df["Count"] = 1
print df
#
# Call_Time Customer_ID Count
#0 2015-08-14 aaa123 1
#1 2015-08-07 aaa123 1
#2 2015-07-15 aaa123 1
#3 2015-08-22 aaa123 1
#4 2015-08-03 bbb222 1
#5 2015-08-08 bbb222 1
#6 2015-08-10 bbb222 1
#
#grouping by frequency and Customer_ID
g14 = df.groupby([pd.Grouper(freq='14D',key='Call_Time'),'Customer_ID']).sum()
g7 = df.groupby([pd.Grouper(freq='7D',key='Call_Time'),'Customer_ID']).sum()
g3 = df.groupby([pd.Grouper(freq='3D',key='Call_Time'),'Customer_ID']).sum()
print g14
print g7
print g3
#
# Count
#Call_Time Customer_ID
#2015-07-15 aaa123 1
#2015-07-29 aaa123 1
# bbb222 3
#2015-08-12 aaa123 2
# Count
#Call_Time Customer_ID
#2015-07-15 aaa123 1
#2015-07-29 bbb222 1
#2015-08-05 aaa123 1
# bbb222 2
#2015-08-12 aaa123 1
#2015-08-19 aaa123 1
# Count
#Call_Time Customer_ID
#2015-07-15 aaa123 1
#2015-08-02 bbb222 1
#2015-08-05 aaa123 1
#2015-08-08 bbb222 2
#2015-08-14 aaa123 1
#2015-08-20 aaa123 1
排序Call_Time
仅用于检查数据:
Customer_ID
修改强> 我不知道我理解正确,所以我修改了我的解决方案:
df = df.groupby("Customer_ID", sort=False).apply(lambda x: x.sort('Call_Time')).drop('Customer_ID', axis=1)
df = df.reset_index(0)
我在import pandas as pd
import io
temp1=u"""Call_Time,Customer_ID,Survey
8/26/2015,aaa123,1
8/27/2015,bbb222,1"""
temp=u"""Call_Time,Customer_ID
8/14/2015 0:00,aaa123
8/7/2015 0:00,aaa123
7/15/2015 0:00,aaa123
8/22/2015 0:00,aaa123
8/3/2015 0:00,bbb222
8/8/2015 0:00,bbb222
8/10/2015 0:00,bbb222"""
fcr = pd.read_csv(io.StringIO(temp), parse_dates=True)
df = pd.read_csv(io.StringIO(temp1), parse_dates=True)
fcr['Call_Time'] = pd.to_datetime(fcr['Call_Time'], format='%m/%d/%Y %H:%M')
df['Call_Time'] = pd.to_datetime(df['Call_Time'], format='%m/%d/%Y')
fcr['Total_Hits'] = 1
g14 = fcr.groupby([pd.Grouper(freq='14D',key='Call_Time'),'Customer_ID']).sum().reset_index()
g7 = fcr.groupby([pd.Grouper(freq='7D',key='Call_Time'),'Customer_ID']).sum().reset_index()
g3 = fcr.groupby([pd.Grouper(freq='3D',key='Call_Time'),'Customer_ID']).sum().reset_index()
g14 = g14.rename(columns={'Total_Hits':'Total_Hits_14'})
g7 = g7.rename(columns={'Total_Hits':'Total_Hits_7'})
g3 = g3.rename(columns={'Total_Hits':'Total_Hits_3'})
temp = pd.merge(g14, g7, how ='outer', on = ['Call_Time', 'Customer_ID'])
previous_hits = pd.merge(temp, g3, how ='outer', on = ['Call_Time', 'Customer_ID'])
df2 = pd.merge(df, previous_hits, how ='left', on = ['Customer_ID'])
df2 = df2.rename(columns={'Call_Time_x':'Call_Time', 'Call_Time_y':'Call_Time_fcr'})
print df2
#
# Call_Time Customer_ID Survey Call_Time_fcr Total_Hits_14 Total_Hits_7 \
#0 2015-08-26 aaa123 1 2015-07-15 1 1
#1 2015-08-26 aaa123 1 2015-07-29 1 NaN
#2 2015-08-26 aaa123 1 2015-08-12 2 1
#3 2015-08-26 aaa123 1 2015-08-05 NaN 1
#4 2015-08-26 aaa123 1 2015-08-19 NaN 1
#5 2015-08-26 aaa123 1 2015-08-14 NaN NaN
#6 2015-08-26 aaa123 1 2015-08-20 NaN NaN
#7 2015-08-27 bbb222 1 2015-07-29 3 1
#8 2015-08-27 bbb222 1 2015-08-05 NaN 2
#9 2015-08-27 bbb222 1 2015-08-02 NaN NaN
#10 2015-08-27 bbb222 1 2015-08-08 NaN NaN
# Total_Hits_3
#0 1
#1 NaN
#2 NaN
#3 1
#4 NaN
#5 1
#6 1
#7 NaN
#8 NaN
#9 1
#10 2
,g14
和g7
追加g3
重置了多索引。
我认为在数据帧reset_index()
中更好的是使用来自两个帧(previous_hits
)的键的联合
为了优雅,我重命名了多个列。没有必要。