在我的下面的数据集中,我需要找到唯一的序列并为它们分配序列号..
DataSet:
user age maritalstatus product
A Young married 111
B young married 222
C young Single 111
D old single 222
E old married 111
F teen married 222
G teen married 555
H adult single 444
I adult single 333
预期产出:
young married 0
young single 1
old single 2
old married 3
teen married 4
adult single 5
找到如上所示的唯一值后,如果我传递了如下所示的新用户,
user age maritalstatus
X young married
它应该将产品作为清单返回给我。
X : [111, 222]
如果没有序列,如下面
user age maritalstatus
Y adult married
它应该给我一个空列表
Y : []
答案 0 :(得分:2)
首先只选择输出列并添加drop_duplicates
,然后按range
添加新列:
df = df[['age','maritalstatus']].drop_duplicates()
df['no'] = range(len(df.index))
print (df)
age maritalstatus no
0 Young married 0
1 young married 1
2 young Single 2
3 old single 3
4 old married 4
5 teen married 5
7 adult single 6
如果想要首先将所有值转换为小写:
df = df[['age','maritalstatus']].apply(lambda x: x.str.lower()).drop_duplicates()
df['no'] = range(len(df.index))
print (df)
age maritalstatus no
0 young married 0
2 young single 1
3 old single 2
4 old married 3
5 teen married 4
7 adult single 5
编辑:
首先转换为lowercase
:
df[['age','maritalstatus']] = df[['age','maritalstatus']].apply(lambda x: x.str.lower())
print (df)
user age maritalstatus product
0 A young married 111
1 B young married 222
2 C young single 111
3 D old single 222
4 E old married 111
5 F teen married 222
6 G teen married 555
7 H adult single 444
8 I adult single 333
然后使用merge
将唯一product
转换为list
:
df2 = pd.DataFrame([{'user':'X', 'age':'young', 'maritalstatus':'married'}])
print (df2)
age maritalstatus user
0 young married X
a = pd.merge(df, df2, on=['age','maritalstatus'])['product'].unique().tolist()
print (a)
[111, 222]
df2 = pd.DataFrame([{'user':'X', 'age':'adult', 'maritalstatus':'married'}])
print (df2)
age maritalstatus user
0 adult married X
a = pd.merge(df, df2, on=['age','maritalstatus'])['product'].unique().tolist()
print (a)
[]
但是如果需要列使用transform
:
df['prod'] = df.groupby(['age', 'maritalstatus'])['product'].transform('unique')
print (df)
user age maritalstatus product prod
0 A young married 111 [111, 222]
1 B young married 222 [111, 222]
2 C young single 111 [111]
3 D old single 222 [222]
4 E old married 111 [111]
5 F teen married 222 [222, 555]
6 G teen married 555 [222, 555]
7 H adult single 444 [444, 333]
8 I adult single 333 [444, 333]
EDIT1:
a = (pd.merge(df, df2, on=['age','maritalstatus'])
.groupby('user_y')['product']
.apply(lambda x: x.unique().tolist())
.to_dict())
print (a)
{'X': [111, 222]}
<强>详细强>:
print (pd.merge(df, df2, on=['age','maritalstatus']))
user_x age maritalstatus product user_y
0 A young married 111 X
1 B young married 222 X
答案 1 :(得分:0)
一种方法是pd.factorize
。注意我首先将列转换为小写,以使结果有意义。
for col in ['user', 'age', 'maritalstatus']:
df[col] = df[col].str.lower()
df['category'] = list(zip(df.age, df.maritalstatus))
df['category'] = pd.factorize(df['category'])[0]
# user age maritalstatus product category
# 0 a young married 111 0
# 1 b young married 222 0
# 2 c young single 111 1
# 3 d old single 222 2
# 4 e old married 111 3
# 5 f teen married 222 4
# 6 g teen married 555 4
# 7 h adult single 444 5
# 8 i adult single 333 5
最后,删除重复项:
df_cats = df[['age', 'maritalstatus', 'category']].drop_duplicates()
# age maritalstatus category
# 0 young married 0
# 2 young single 1
# 3 old single 2
# 4 old married 3
# 5 teen married 4
# 7 adult single 5
要映射产品列表,请尝试以下操作:
s = df.groupby(['age', 'maritalstatus'])['product'].apply(list)
df['prod_catwise'] = list(map(s.get, zip(df.age, df.maritalstatus)))
另一种选择是使用categorical data,我强烈推荐它用于工作流程。您可以通过pd.Series.cat.codes
轻松地从分类系列中提取代码。