达斯克(Dask):转换多对多关系DataFrame

时间:2019-03-20 14:56:08

标签: python dataframe dask

我有一个像下面这样的dask DataFrame。

> print(df_user_preferences)
       user_id  food_id
int64  int64    int64
...

此数据框表示userfood之间的多对多关系。 还有两个数据帧,df_usersdf_foods,它们是用户和食品的主数据。

现在,我要获得如下所示的数据框。

# index is user_id.
> print(df_spread_user_preferences)
       food_1   food_2   food_3   food_4  ...
int64  boolean  boolean  boolean  boolean ...
...

这些带有前缀food_的列以food_id结尾,并且它们的值表示userfood之间的关系。

我尝试了下面的代码,但这太慢了。 如何改善此代码以使其更有效?

df_spread_user_preferences = df_users.assign(**{
    f"food_{food_id}": lambda df, food_id: df.apply(
      lambda row, food_id: len(df_user_preferences[(
          df_user_preferences.food_id == food_id
      ) & (
          df_user_preferences.user_id == row.name
      )]) > 0,
      axis=1,
      meta='boolean',
      food_id=food_id
    ) for _, food_id in df_foods.index.to_series().iteritems()
}).drop(df_users.columns)

1 个答案:

答案 0 :(得分:0)

df_users = pd.DataFrame({'user_id': [1,2]})
df_foods = pd.DataFrame({'food_id': [11,22,33,44]})
df_user_preferences = pd.DataFrame({'user_id' : [1,1], 'food_id' : [11,22]})

# Create a dataframe with columns user_ids and all food_ids.
# All food_ids of all the users are assigned False
df_spread_user_preferences = pd.DataFrame({
        **{'user_id': df_users['user_id']}, 
        **{"food_{0}".format(i):False for i in df_foods['food_id']}})
# Find the food preference of the users and create a list 
foods = df_user_preferences.groupby(['user_id'])['food_id'].apply(list).apply(
    lambda x: ["food_{0}".format(i) for i in x]).reset_index()
# For each user get the preference list and reset them to True 
for _, r in foods.iterrows():
     df_spread_user_preferences.loc[
df_spread_user_preferences['user_id'] == r['user_id'], r['food_id']] = True

print (df_spread_user_preferences)

food_11 food_22 food_33 food_44 user_id 0 True True False False 1 1 False False False False 2

您可以使用df_spread_user_preferences.set_index('user_id')

将索引设置为user_id。