Dask:从提交的作业中访问已发布的数据集

时间:2020-05-14 15:31:29

标签: pandas dask dask-distributed dask-dataframe

# Init
import time
import pandas as pd
import numpy as np
from dask.distributed import Client
client = Client()
# Publish data
dataset_name = 'my_dataset'
df_my_dataset = pd.DataFrame(np.ones((2,3)), dtype=np.float32)
client.publish_dataset(df_my_dataset, name=dataset_name)

它在那里:

In [13]: client.list_datasets()                                                                                                           
Out[13]: ('my_dataset',)

创建dask的提交功能。在这里,我想按名称访问发布的数据集:

# submit function
def get_gate1_rows(df_from_submit):
    return df_from_submit.mean()
    # return df.mean() + my_dataset.mean() #### <<<<<<< How to do this?

最后提交:

# Submit code
df_zeros = np.zeros((2,3), dtype=np.float32)
future = client.submit(get_gate1_rows, df_zeros)
time.sleep(2)
result = future.result()

这会产生-但应为0.5

In [41]: result                                                                                                                           
Out[41]: 0.0

那么我如何从 任务之内访问published dataset

1 个答案:

答案 0 :(得分:2)

要访问任务中已发布的数据集,您需要get_client

def get_gate1_rows(df_from_submit):
    client = distributed.get_client()
    my_dataset = client.get_dataset('my_dataset')
    return df_from_submit.mean() + my_dataset.mean()

(答案是三个1,因为df_zeros.mean()-> 0,df_my_dataset.mean()-> 1,1,1)