如何制作一个确定列单元格值是否重复的虚拟列?

时间:2018-03-14 15:41:53

标签: python pandas duplicates

所以,我的数据是旅行数据。

我想创建一个列df['user_type'],在其中确定df['user_id']是否出现过多次。如果确实发生过多次,我会将它们列为常用用户。

以下是我的代码,但这需要太长时间:

#Column that determines user type
def determine_user_type(val):
  df_freq = df[df['user_id'].duplicated()]

  user_type = ""
  if(val in df_freq['user_id'].values):
    user_type = "Frequent"
  else:
    user_type = "Single"

return user_type

df['user_type'] = df['user_id'].apply(lambda x: determine_user_type(x))

3 个答案:

答案 0 :(得分:4)

numpy.whereduplicated一起使用,并返回所有dupes添加参数keep=False

df = pd.DataFrame({'user_id':list('aaacbbt')})

df['user_type'] = np.where(df['user_id'].duplicated(keep=False), 'Frequent','Single')

替代:

d = {True:'Frequent',False:'Single'}
df['user_type'] = df['user_id'].duplicated(keep=False).map(d)

print (df)
  user_id user_type
0       a  Frequent
1       a  Frequent
2       a  Frequent
3       c    Single
4       b  Frequent
5       b  Frequent
6       t    Single

编辑:

df = pd.DataFrame({'user_id':list('aaacbbt')})
print (df)
  user_id
0       a
1       a
2       a
3       c
4       b
5       b
6       t

此处drop_duplicates逐列删除所有重复项user_id并仅返回第一行(默认参数为keep='first'):

df_single = df.drop_duplicates('user_id')
print (df_single)
  user_id
0       a
3       c
4       b
6       t

Series.duplicated首先返回True以获取所有欺骗:

print (df['user_id'].duplicated())
0    False
1     True
2     True
3    False
4    False
5     True
6    False
Name: user_id, dtype: bool

df_freq = df[df['user_id'].duplicated()]
print (df_freq)
  user_id
1       a
2       a
5       b

答案 1 :(得分:2)

使用jezrael的数据

df = pd.DataFrame({'user_id':list('aaacbbt')})

您可以使用数组切片

df.assign(
    user_type=
    np.array(['Single', 'Frequent'])[
        df['user_id'].duplicated(keep=False).astype(int)
    ]
)

  user_id user_type
0       a  Frequent
1       a  Frequent
2       a  Frequent
3       c    Single
4       b  Frequent
5       b  Frequent
6       t    Single

答案 2 :(得分:2)

来自Jez的数据,方法涉及value_counts

df.user_id.map(df.user_id.value_counts().gt(1).replace({True:'Frequent',False:'Single'}))
Out[52]: 
0    Frequent
1    Frequent
2    Frequent
3      Single
4    Frequent
5    Frequent
6      Single
Name: user_id, dtype: object