所以,我的数据是旅行数据。
我想创建一个列df['user_type']
,在其中确定df['user_id']
是否出现过多次。如果确实发生过多次,我会将它们列为常用用户。
以下是我的代码,但这需要太长时间:
#Column that determines user type
def determine_user_type(val):
df_freq = df[df['user_id'].duplicated()]
user_type = ""
if(val in df_freq['user_id'].values):
user_type = "Frequent"
else:
user_type = "Single"
return user_type
df['user_type'] = df['user_id'].apply(lambda x: determine_user_type(x))
答案 0 :(得分:4)
将numpy.where
与duplicated
一起使用,并返回所有dupes添加参数keep=False
:
df = pd.DataFrame({'user_id':list('aaacbbt')})
df['user_type'] = np.where(df['user_id'].duplicated(keep=False), 'Frequent','Single')
替代:
d = {True:'Frequent',False:'Single'}
df['user_type'] = df['user_id'].duplicated(keep=False).map(d)
print (df)
user_id user_type
0 a Frequent
1 a Frequent
2 a Frequent
3 c Single
4 b Frequent
5 b Frequent
6 t Single
编辑:
df = pd.DataFrame({'user_id':list('aaacbbt')})
print (df)
user_id
0 a
1 a
2 a
3 c
4 b
5 b
6 t
此处drop_duplicates
逐列删除所有重复项user_id
并仅返回第一行(默认参数为keep='first'
):
df_single = df.drop_duplicates('user_id')
print (df_single)
user_id
0 a
3 c
4 b
6 t
但Series.duplicated
首先返回True
以获取所有欺骗:
print (df['user_id'].duplicated())
0 False
1 True
2 True
3 False
4 False
5 True
6 False
Name: user_id, dtype: bool
df_freq = df[df['user_id'].duplicated()]
print (df_freq)
user_id
1 a
2 a
5 b
答案 1 :(得分:2)
使用jezrael的数据
df = pd.DataFrame({'user_id':list('aaacbbt')})
您可以使用数组切片
df.assign(
user_type=
np.array(['Single', 'Frequent'])[
df['user_id'].duplicated(keep=False).astype(int)
]
)
user_id user_type
0 a Frequent
1 a Frequent
2 a Frequent
3 c Single
4 b Frequent
5 b Frequent
6 t Single
答案 2 :(得分:2)
来自Jez的数据,方法涉及value_counts
df.user_id.map(df.user_id.value_counts().gt(1).replace({True:'Frequent',False:'Single'}))
Out[52]:
0 Frequent
1 Frequent
2 Frequent
3 Single
4 Frequent
5 Frequent
6 Single
Name: user_id, dtype: object