我们假设我有一个这种类型的熊猫数据框(最小例子):
myDf = pd.DataFrame({'user': ['A'','B', 'C', 'D', 'E']*2,'date': ['2017-05-25']*5+['2017-05-26']*5,'nVisits':[10,2,3,0,0,6,0,4,8,1]})
表格如下:
date nVisits user
5/25/2017 10 A
5/25/2017 2 B
5/25/2017 3 C
5/25/2017 0 D
5/25/2017 0 E
5/26/2017 6 A
5/26/2017 0 B
5/26/2017 4 C
5/26/2017 8 D
5/26/2017 1 E
(1)我想每天将我的用户分类为4个桶:0次访问,1次访问,2-4次访问,5次访问,所以我想创建一个看起来像数据框的摘要这样:
date group nVisits nObs
5/25/2017 zero 0 2
5/25/2017 one 0 0
5/25/2017 twoToFour 2 2
5/25/2017 fivePlus 10 1
5/26/2017 zero 0 1
5/26/2017 one 1 1
5/26/2017 twoToFour 4 1
5/26/2017 fivePlus 16 2
此数据框基本上是每个桶的观察次数以及每个桶的访问次数,用户属于哪个桶每天更新一次。
(2)我想列出所有出生和死亡的客户,其中出生被归类为从0次访问到> 1次访问的客户,以及作为客户的死亡从> 1次访问到0次访问。
在此具体示例中,新数据框将如下所示:
date event_type user nVisitsAtBirthDeath
5/26/2017 death B 2
5/26/2017 birth D 8
5/26/2017 birth E 1
这个数据框基本上是从今天到前一天的比较,用户从0次访问到多次或等于1次访问,以及哪些用户从1次访问次数增加到1次访问次数。
你能帮助我开始以高效和高效的方式开展这项工作吗?我的原始数据帧相对较大,因此在python中执行循环运行速度太慢。
答案 0 :(得分:4)
我使用pd.cut()方法:
In [29]: df['group'] = pd.cut(df.nVisits,
[-1, 0, 1, 4, np.inf],
labels=['zero','one','twoToFour','fivePlus'])
In [30]: df
Out[30]:
date nVisits user group
0 2017-05-25 10 A fivePlus
1 2017-05-25 2 B twoToFour
2 2017-05-25 3 C twoToFour
3 2017-05-25 0 D zero
4 2017-05-25 0 E zero
5 2017-05-26 6 A fivePlus
6 2017-05-26 0 B zero
7 2017-05-26 4 C twoToFour
8 2017-05-26 8 D fivePlus
9 2017-05-26 1 E one
答案 1 :(得分:2)
一种方法是使用np.where()
myDf [' group'] = np.where(myDf.nVisits> 5,' fiveplus',np.where(myDf.nVisits == 0,'零&#39 ;, np.where(myDf.nVisits == 1,' one',' twotofour')))
date nVisits user group
0 2017-05-25 10 A fiveplus
1 2017-05-25 2 B twotofour
2 2017-05-25 3 C twotofour
3 2017-05-25 0 D zero
4 2017-05-25 0 E zero
5 2017-05-26 6 A fiveplus
6 2017-05-26 0 B zero
7 2017-05-26 4 C twotofour
8 2017-05-26 8 D fiveplus
9 2017-05-26 1 E one
答案 2 :(得分:2)
df1 = myDf.assign(group=pd.cut(myDf.nVisits,[0,1,2,5,np.inf],right=False,labels=['zero','one','twotoFour','fivePlus']))
输出:
date nVisits user group
0 2017-05-25 10 A fivePlus
1 2017-05-25 2 B twotoFour
2 2017-05-25 3 C twotoFour
3 2017-05-25 0 D zero
4 2017-05-25 0 E zero
5 2017-05-26 6 A fivePlus
6 2017-05-26 0 B zero
7 2017-05-26 4 C twotoFour
8 2017-05-26 8 D fivePlus
9 2017-05-26 1 E one
df2 = df1.groupby(['date','group']).agg({'nVisits':'sum','user':'count'}).reset_index()
print(df2)
date group user nVisits
0 2017-05-25 fivePlus 1 10
1 2017-05-25 twotoFour 2 5
2 2017-05-25 zero 2 0
3 2017-05-26 fivePlus 2 14
4 2017-05-26 one 1 1
5 2017-05-26 twotoFour 1 4
6 2017-05-26 zero 1 0
df2 = df1.assign(nVisitsAtBirthDeath=df1.groupby('user').filter(lambda x: x.nVisits.eq(0).any()).groupby('user')['nVisits'].apply(lambda x: x - x.shift())).dropna()
df3 = df2.assign(event=np.where(df2.nVisitsAtBirthDeath<0,'Death','Birth'))
print(df3)
输出:
date nVisits user group nVisitsAtBirthDeath event
6 2017-05-26 0 B zero -2.0 Death
8 2017-05-26 8 D fivePlus 8.0 Birth
9 2017-05-26 1 E one 1.0 Birth
答案 3 :(得分:1)
1。 第一项的解决方案
def label(visits):
if visits == 0:
return 'zero'
if visits == 1:
return 'one'
if visits < 5:
return 'twoToFour'
return 'fivePlus'
myDf['group'] = myDf['nVisits'].apply(label)
2。 第二项的解决方案
myDf['last_day_visits'] = myDf.groupby('user').nVisits.shift(1).fillna(0)
def event_type(row):
if row['nVisits'] > 0 and row['last_day_visits'] == 0:
return 'birth'
if row['nVisits'] == 0 and row['last_day_visits'] > 0:
return 'death'
myDf['event_type'] = myDf.apply(event_type, axis=1)