使用Dask并行化HDF read-translate-write

时间:2017-09-29 21:46:49

标签: python pandas dask dask-delayed

TL; DR:我们在使用从同一个HDF读取和写入的Dask并行化Pandas代码时出现问题

我正在开展一个通常需要三个步骤的项目:阅读,翻译(或组合数据)以及编写这些数据。对于上下文,我们正在处理医疗记录,我们收到不同格式的声明,将其转换为标准格式,然后将其重新写入磁盘。理想情况下,我希望以某种形式保存中间数据集,以后可以通过Python / Pandas访问。

目前,我选择了HDF作为我的数据存储格式,但是我遇到了运行时问题。在人口众多的情况下,我的代码目前可能需要几天时间。这让我调查Dask,但我并不乐观,我已经将Dask应用到了我的最佳状态。

以下是我的工作流程的一个工作示例,希望有足够的示例数据来了解运行时问题。

读取(在本例中为“创建”)数据

import pandas as pd
import numpy as np
import dask
from dask import delayed
from dask import dataframe as dd
import random
from datetime import timedelta
from pandas.io.pytables import HDFStore

member_id = range(1, 10000)
window_start_date = pd.to_datetime('2015-01-01')
start_date_col = [window_start_date + timedelta(days=random.randint(0, 730)) for i in member_id]

# Eligibility records
eligibility = pd.DataFrame({'member_id': member_id,
                            'start_date': start_date_col})
eligibility['end_date'] = eligibility['start_date'] + timedelta(days=365)
eligibility['insurance_type'] = np.random.choice(['HMO', 'PPO'], len(member_id), p=[0.4, 0.6])
eligibility['gender'] = np.random.choice(['F', 'M'], len(member_id), p=[0.6, 0.4])
(eligibility.set_index('member_id')
 .to_hdf('test_data.h5',
         key='eligibility',
         format='table'))

# Inpatient records
inpatient_record_number = range(1, 20000)
service_date = [window_start_date + timedelta(days=random.randint(0, 730)) for i in inpatient_record_number]
inpatient = pd.DataFrame({'inpatient_record_number': inpatient_record_number,
                          'service_date': service_date})
inpatient['member_id'] = np.random.choice(list(range(1, 10000)), len(inpatient_record_number))
inpatient['procedure'] = np.random.choice(['A', 'B', 'C', 'D'], len(inpatient_record_number))
(inpatient.set_index('member_id')
 .to_hdf('test_data.h5',
         key='inpatient',
         format='table'))

# Outpatient records
outpatient_record_number = range(1, 30000)
service_date = [window_start_date + timedelta(days=random.randint(0, 730)) for i in outpatient_record_number]
outpatient = pd.DataFrame({'outpatient_record_number': outpatient_record_number,
                           'service_date': service_date})
outpatient['member_id'] = np.random.choice(range(1, 10000), len(outpatient_record_number))
outpatient['procedure'] = np.random.choice(['A', 'B', 'C', 'D'], len(outpatient_record_number))
(outpatient.set_index('member_id')
 .to_hdf('test_data.h5',
         key='outpatient',
         format='table'))

翻译/写入数据

顺序方法

def pull_member_data(member_i):
    inpatient_slice = pd.read_hdf('test_data.h5', 'inpatient', where='index == "{}"'.format(member_i))
    outpatient_slice = pd.read_hdf('test_data.h5', 'outpatient', where='index == "{}"'.format(member_i))
    return inpatient_slice, outpatient_slice


def create_visits(inpatient_slice, outpatient_slice):
    # In reality this is more complicated, using some logic to combine inpatient/outpatient/ER into medical 'visits'
    # But for simplicity, we'll just stack the inpatient/outpatient and assign a record identifier
    visits_stacked = pd.concat([inpatient_slice, outpatient_slice]).reset_index().sort_values('service_date')
    visits_stacked.insert(0, 'visit_id', range(1, len(visits_stacked) + 1))
    return visits_stacked


def save_visits_to_hdf(visits_slice):
    with HDFStore('test_data.h5', mode='a') as store:
        store.append('visits', visits_slice)


# Read in the data by member_id, perform some operation
def translate_by_member(member_i):
    inpatient_slice, outpatient_slice = pull_member_data(member_i)
    visits_slice = create_visits(inpatient_slice, outpatient_slice)
    save_visits_to_hdf(visits_slice)


def run_translate_sequential():
    # Simple approach: Loop through each member sequentially
    for member_i in member_id:
        translate_by_member(member_i)

run_translate_sequential()

上面的代码需要大约9分钟才能在我的机器上运行。

Dask方法

def create_visits_dask_version(visits_stacked):
    # In reality this is more complicated, using some logic to combine inpatient/outpatient/ER
    # But for simplicity, we'll just stack the inpatient/outpatient and assign a record identifier
    len_of_visits = visits_stacked.shape[0]
    visits_stacked_1 = (visits_stacked
                        .sort_values('service_date')
                        .assign(visit_id=range(1, len_of_visits + 1))
                        .set_index('visit_id')
                        )
    return visits_stacked_1


def run_translate_dask():
    # Approach 2: Dask, with individual writes to HDF
    inpatient_dask = dd.read_hdf('test_data.h5', 'inpatient')
    outpatient_dask = dd.read_hdf('test_data.h5', 'outpatient')
    stacked = dd.concat([inpatient_dask, outpatient_dask])
    visits = stacked.groupby('member_id').apply(create_visits_dask_version)
    visits.to_hdf('test_data_dask.h5', 'visits')

run_translate_dask()

这个Dask方法需要13秒(!)

虽然这是一项很大的进步,但我们对一些事情一直很好奇:

  1. 鉴于这个简单的例子,是使用Dask数据帧,连接它们,使用groupby /应用最佳方法的方法吗?

  2. 实际上,我们有多个这样的过程从同一个HDF读取,并写入相同的HDF。我们的原始代码库的结构允许一次member_id运行整个工作流。当我们尝试并行化它们时,它有时会处理小样本,但大多数时候会产生分段错误。并行化这样的工作流程,使用HDF读取/写入是否存在已知问题?我们也在努力制作一个这样的示例,但我们发现这里会发布这个以防万一这会触发建议(或者这个代码可以帮助遇到类似问题的人)。

  3. 感谢任何反馈!

1 个答案:

答案 0 :(得分:1)

一般来说,groupby-apply会相当慢。采用这样的数据通常具有挑战性,尤其是在有限的内存中。

一般情况下,我建议使用Parquet格式(dask.dataframe有to_和read_parquet函数)。与使用HDF文件相比,您获得段错误的可能性要小得多。