在Dask中加入/查找/映射列值的最有效方法?

时间:2017-06-12 16:25:30

标签: python dask

鉴于Dask DataFrame,我正在尝试找到应用静态值查找的最有效方法。

示例问题:我的数据有一列"user_id",其中包含四个可能的值[4823, 1292, 9634, 7431]。我想将这些值映射到[0, 1, 2, 3]并将结果存储为新列"user_id_mapped"

在Dask中实现这一目标的最有效方法是什么?一种可能性是将主df连接到lookup_df,但连接是一个相当复杂的操作。即使在普通的Pandas中,基于索引的解决方案通常比加入/合并快得多,例如:

N = 100000
user_ids = [4823, 1292, 9634, 7431]

df = pd.DataFrame({
    "user_id": np.random.choice(user_ids, size=N),
    "dummy": np.random.uniform(size=N),
})

id_lookup_series = pd.Series(data=[0, 1, 2, 3], index=user_ids)

df["user_id_mapped"] = id_lookup_series[df["user_id"]].reset_index(drop=True)

我无法将此方法转移到Dask,因为静态id_lookup_series是普通的Pandas系列,而索引df["user_id"]是Dask系列。是否可以在Dask中执行这种快速连接?

2 个答案:

答案 0 :(得分:1)

Pandas Solution

You can use merge if you convert your Pandas series to a DataFrame

In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: N = 100000

In [4]: user_ids = [4823, 1292, 9634, 7431]

In [5]: df = pd.DataFrame({
   ...:     "user_id": np.random.choice(user_ids, size=N),
   ...:     "dummy": np.random.uniform(size=N),
   ...: })
   ...: 
   ...: id_lookup_series = pd.Series(data=[0, 1, 2, 3], index=user_ids)
   ...: 

In [6]: result = df.merge(id_lookup_series.to_frame(), left_on='user_id', right_
   ...: index=True)

In [7]: result.head()
Out[7]: 
       dummy  user_id  0
0   0.416698     1292  1
1   0.053371     1292  1
6   0.407371     1292  1
14  0.772367     1292  1
18  0.958009     1292  1

Dask Dataframe Solution

Everything above works fine with Dask.dataframe as well. I wasn't sure if you knew the user ID's ahead of time or not, so I added in a step to compute them.

In [1]: import numpy as np

In [2]: import pandas as pd
N 
In [3]: N = 100000

In [4]: user_ids = [4823, 1292, 9634, 7431]

In [5]: df = pd.DataFrame({
   ...:     "user_id": np.random.choice(user_ids, size=N),
   ...:     "dummy": np.random.uniform(size=N),
   ...: })

In [6]: import dask.dataframe as dd

In [7]: ddf = dd.from_pandas(df, npartitions=10)

In [8]: user_ids = ddf.user_id.drop_duplicates().compute()

In [9]: id_lookup_series = pd.Series(list(range(len(user_ids))), index=user_ids.values)

In [10]: result = ddf.merge(id_lookup_series.to_frame(), left_on='user_id', right_index=True)

In [11]: result.head()
Out[11]: 
       dummy  user_id  0
0   0.364693     4823  0
5   0.934778     4823  0
14  0.970289     4823  0
15  0.561710     4823  0
21  0.838962     4823  0

答案 1 :(得分:1)

我不确定为什么提供的代码如此复杂。根据我在示例问题描述中的内容,您需要将一组值替换为另一组,因此您使用Series.replace(to_replace={})方法与Dask.DataFrame.map_partitions()结合使用:

def replacer(df, to_replace):
    df['user_id_mapped'] = df['user_id'].replace(to_replace=to_replace)
    return df

new_dask_df = dask_df.map_partitions(
    replacer,
    to_replace={4823: 0, 1292: 1, 9634: 2, 7431: 3}
)

P.S。您可能想要了解meta map_partitions参数,并考虑将代码组织到一个类中以使其更好并避免闭包,但这是另一个主题。