我有一个像下面这样的dask DataFrame。
> print(df_user_preferences)
user_id food_id
int64 int64 int64
...
此数据框表示user
和food
之间的多对多关系。
还有两个数据帧,df_users
和df_foods
,它们是用户和食品的主数据。
现在,我要获得如下所示的数据框。
# index is user_id.
> print(df_spread_user_preferences)
food_1 food_2 food_3 food_4 ...
int64 boolean boolean boolean boolean ...
...
这些带有前缀food_
的列以food_id
结尾,并且它们的值表示user
和food
之间的关系。
我尝试了下面的代码,但这太慢了。 如何改善此代码以使其更有效?
df_spread_user_preferences = df_users.assign(**{
f"food_{food_id}": lambda df, food_id: df.apply(
lambda row, food_id: len(df_user_preferences[(
df_user_preferences.food_id == food_id
) & (
df_user_preferences.user_id == row.name
)]) > 0,
axis=1,
meta='boolean',
food_id=food_id
) for _, food_id in df_foods.index.to_series().iteritems()
}).drop(df_users.columns)
答案 0 :(得分:0)
df_users = pd.DataFrame({'user_id': [1,2]})
df_foods = pd.DataFrame({'food_id': [11,22,33,44]})
df_user_preferences = pd.DataFrame({'user_id' : [1,1], 'food_id' : [11,22]})
# Create a dataframe with columns user_ids and all food_ids.
# All food_ids of all the users are assigned False
df_spread_user_preferences = pd.DataFrame({
**{'user_id': df_users['user_id']},
**{"food_{0}".format(i):False for i in df_foods['food_id']}})
# Find the food preference of the users and create a list
foods = df_user_preferences.groupby(['user_id'])['food_id'].apply(list).apply(
lambda x: ["food_{0}".format(i) for i in x]).reset_index()
# For each user get the preference list and reset them to True
for _, r in foods.iterrows():
df_spread_user_preferences.loc[
df_spread_user_preferences['user_id'] == r['user_id'], r['food_id']] = True
print (df_spread_user_preferences)
food_11 food_22 food_33 food_44 user_id
0 True True False False 1
1 False False False False 2
您可以使用df_spread_user_preferences.set_index('user_id')