大量导致内存问题的Dask计算

时间:2017-06-02 21:29:01

标签: python pandas dask

我正在开展一项任务,我需要确定两个地理空间点在250米之内的位置,并且彼此之间的距离是20分钟。我的数据集大约是1.2M行和10列。因此,我需要通过1.2M ** 2计算来确定距离,时差以及它们是否符合我的标准。

我已经能够运行下面的代码,在那里我创建了10,000个Dask对象来计算而没有问题。但是,当我尝试测试100,000个对象时,Dask会遇到内存限制,并且我看到交换的CPU占用率很高。为了清楚起见,我在具有125 GB内存的32核心节点上运行它。

不可否认,我对Dask来说还是一个新手,所以我想知道:有没有比处理10,000行块更好的方法来解决这个问题?

#!/usr/bin/env python
import pandas as pd
import numpy as np
import dask.dataframe as dd
from dask.array import sqrt
import time
import multiprocessing as mp


df = pd.read_hdf(...) # Used to select single item for comparison
ddf = dd.read_hdf(...) # Used for Dask operations

def distCheck(item,df=ddf):
    '''
    Determine if any records in df are within 250m of item and within 20
    minutes of item. Return Dask object for calculation.
    '''   

    dist = sqrt(((ddf.LCC_x1-item.LCC_x1)**2+(ddf.LCC_y1-item.LCC_y1)**2))
    distcrit = dist[dist < 250]

    delta = (ddf.Date - item.Date).abs()
    timecrit = delta[delta < np.timedelta64(20,'m')]

    res1 = ddf.copy()
    res1['dist'] = dist
    res1['delta'] = delta
    res1 = res1.loc[(distcrit.index) & (timecrit.index) & (idcrit.index)]
    res1['MatchMMSI'] = item.MMSI
    res1['MatchVoy'] = item.Voyage

    out = res1
    return out


def getDaskCalls(start,stop):
    '''
    Get Dask objects to assess temporal and spatial proximity for df 
    indices from start to stop.
    '''
    # Kick off multiprocessing pool, submit, and close    
    pool = mp.Pool(processes=32)

    daskers = []
    for i in range(start,stop):
        result = pool.apply_async(distCheck,args=(df.iloc[i,:],ddf,))
        daskers.append(result)

    dasky = [i.get() for i in daskers]
    pool.close()    
    return dasky


def runDask(calls):
    result = pd.DataFrame([],columns=calls[0].columns)
    output = dd.compute(calls)
    result = pd.concat([result]+[i for i in output[0] if i.shape[0] != 0])

    return result


###
### Process
###

# Get initial timestamp
start = time.time()

# Create Dask Calls & determine duration
dcalls = getDaskCalls(0,10000)
callsCreated = time.time()

# Print time required to create calls
print("Dask Calls Created.")
print(callsCreated-start)

# Compute the calls with Dask
print("Computing...")
result = runDask(dcalls)

# Print the time for computation
computation = time.time()
print("   ...Done.")
print(computation-callsCreated)

0 个答案:

没有答案