Question

在Pandas中，如果我想创建一个条件虚拟对象列（如果变量等于一个字符串则为1，如果不是则为0），那么我在pandas中的goto是：

data["ebt_dummy"] = np.where((data["paymenttypeid"]=='ebt'), 1, 0)

天真地在dask数据框中尝试此操作会引发错误。按照map_partitions文档中的说明也会抛出错误：

data = data.map_partitions(lambda df: df.assign(ebt_dummy = np.where((df["paymenttypeid"]=='ebt'), 1, 0)),  meta={'paymenttypeid': 'str', 'ebt_dummy': 'i8'})

这样做有什么好办法，或者说是Dask-thonic最多的方式？

Answer 1

这里有一些示例数据：

None

让我们将其转换为数据框

In [1]:
df = pd.DataFrame(np.transpose([np.random.choice(['ebt','other'], (10)),
              np.random.rand(10)]), columns=['paymenttypeid','other'])

df

Out[1]:

  paymenttypeid                 other
0         other    0.3130770966143612
1         other    0.5167434068096931
2           ebt    0.7606898392115471
3           ebt    0.9424572692382547
4           ebt     0.624282017575857
5           ebt    0.8584841824784487
6         other    0.5017083765654611
7         other  0.025994123211164233
8           ebt   0.07045354449612984
9           ebt   0.11976351556850084

并使用In [2]: data = dd.from_pandas(df, npartitions=2)（在系列上）指定：

apply

<强>更新

似乎您传递的In [3]: data['ebt_dummy'] = data.paymenttypeid.apply(lambda x: 1 if x =='ebt' else 0, meta=('paymenttypeid', 'str')) data.compute() Out [3]: paymenttypeid other ebt_dummy 0 other 0.3130770966143612 0 1 other 0.5167434068096931 0 2 ebt 0.7606898392115471 1 3 ebt 0.9424572692382547 1 4 ebt 0.624282017575857 1 5 ebt 0.8584841824784487 1 6 other 0.5017083765654611 0 7 other 0.025994123211164233 0 8 ebt 0.07045354449612984 1 9 ebt 0.11976351556850084 1是问题所在，因为这有效：

meta

在我的示例中，如果我想指定data = data.map_partitions(lambda df: df.assign( ebt_dummy = np.where((df["paymenttypeid"]=='ebt'), 1, 0))) data.compute()，我必须传递当前meta的dtypes，而不是我指定的那个：

data

Answer 2

这对我也很有用：

data['ebt_dummy'] = dd.from_array(np.where((df["paymenttypeid"]=='ebt'), 1, 0))

Dask + Pandas：返回一系列有条件的假人

2 个答案: