鉴于Dask DataFrame,我正在尝试找到应用静态值查找的最有效方法。
示例问题:我的数据有一列"user_id"
,其中包含四个可能的值[4823, 1292, 9634, 7431]
。我想将这些值映射到[0, 1, 2, 3]
并将结果存储为新列"user_id_mapped"
。
在Dask中实现这一目标的最有效方法是什么?一种可能性是将主df
连接到lookup_df
,但连接是一个相当复杂的操作。即使在普通的Pandas中,基于索引的解决方案通常比加入/合并快得多,例如:
N = 100000
user_ids = [4823, 1292, 9634, 7431]
df = pd.DataFrame({
"user_id": np.random.choice(user_ids, size=N),
"dummy": np.random.uniform(size=N),
})
id_lookup_series = pd.Series(data=[0, 1, 2, 3], index=user_ids)
df["user_id_mapped"] = id_lookup_series[df["user_id"]].reset_index(drop=True)
我无法将此方法转移到Dask,因为静态id_lookup_series
是普通的Pandas系列,而索引df["user_id"]
是Dask系列。是否可以在Dask中执行这种快速连接?
答案 0 :(得分:1)
You can use merge if you convert your Pandas series to a DataFrame
In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: N = 100000
In [4]: user_ids = [4823, 1292, 9634, 7431]
In [5]: df = pd.DataFrame({
...: "user_id": np.random.choice(user_ids, size=N),
...: "dummy": np.random.uniform(size=N),
...: })
...:
...: id_lookup_series = pd.Series(data=[0, 1, 2, 3], index=user_ids)
...:
In [6]: result = df.merge(id_lookup_series.to_frame(), left_on='user_id', right_
...: index=True)
In [7]: result.head()
Out[7]:
dummy user_id 0
0 0.416698 1292 1
1 0.053371 1292 1
6 0.407371 1292 1
14 0.772367 1292 1
18 0.958009 1292 1
Everything above works fine with Dask.dataframe as well. I wasn't sure if you knew the user ID's ahead of time or not, so I added in a step to compute them.
In [1]: import numpy as np
In [2]: import pandas as pd
N
In [3]: N = 100000
In [4]: user_ids = [4823, 1292, 9634, 7431]
In [5]: df = pd.DataFrame({
...: "user_id": np.random.choice(user_ids, size=N),
...: "dummy": np.random.uniform(size=N),
...: })
In [6]: import dask.dataframe as dd
In [7]: ddf = dd.from_pandas(df, npartitions=10)
In [8]: user_ids = ddf.user_id.drop_duplicates().compute()
In [9]: id_lookup_series = pd.Series(list(range(len(user_ids))), index=user_ids.values)
In [10]: result = ddf.merge(id_lookup_series.to_frame(), left_on='user_id', right_index=True)
In [11]: result.head()
Out[11]:
dummy user_id 0
0 0.364693 4823 0
5 0.934778 4823 0
14 0.970289 4823 0
15 0.561710 4823 0
21 0.838962 4823 0
答案 1 :(得分:1)
我不确定为什么提供的代码如此复杂。根据我在示例问题描述中的内容,您需要将一组值替换为另一组,因此您使用Series.replace(to_replace={})
方法与Dask.DataFrame.map_partitions()
结合使用:
def replacer(df, to_replace):
df['user_id_mapped'] = df['user_id'].replace(to_replace=to_replace)
return df
new_dask_df = dask_df.map_partitions(
replacer,
to_replace={4823: 0, 1292: 1, 9634: 2, 7431: 3}
)
P.S。您可能想要了解meta
map_partitions
参数,并考虑将代码组织到一个类中以使其更好并避免闭包,但这是另一个主题。