并行DataFrame自定义功能Dask

时间:2020-06-23 17:33:46

标签: python pandas dataframe dask

我正在尝试使用Dask通过Dask的多处理功能来加快Python DataFrame的循环操作速度。我完全知道for循环数据帧通常不是最佳实践,但就我而言,这是必需的。我已经仔细阅读了文档和其他类似问题,但是似乎无法弄清楚我的问题。

df.head()
         Title                                                                                                                                       Content
0  Lizzibtz     @Ontario2020 @Travisdhanraj @fordnation Maybe.  They are not adding to the stress of education during Covid. Texas sample.  Plus…  
1  Jess ???️‍?  @BetoORourke So ashamed at how Abbott has not handled COVID in Texas. A majority of our large cities are hot spots with no end in sight.    
2  sidi diallo  New post (PVC Working Gloves) has been published on Covid-19 News Info - Texas test                    
3  Kautillya    @PandaJay What was the need to go to SC for yatra anyway? Isn't covid cases spiking exponentially? Ambubachi mela o… texas
4  SarahLou♡    RT @BenJolly9: 23rd June 2020 was the day Sir Keir Starmer let the Tories off the hook for their miss-handling of COVID-19. texas   

我有一个自定义的python函数,定义为:

def locMp(df):
    hitList = []
    for i in range(len(df)):
        print(i)
        string = df.iloc[i]['Content']
        # print(string)
        doc = nlp(string)
        ents = [e.text for e in doc.ents if e.label_ == "GPE"]
        x = np.array(ents)
        print(np.unique(x))
        hitList.append(np.unique(x))

    df['Locations'] = hitList
    return df

此功能添加了从名为spacy的库中提取的位置的数据框列-我认为这并不重要,但是我希望您看到整个功能。

现在,通过文档和其他一些问题。对数据帧使用Dask多重处理的方法是创建一个Dask数据帧,将其map_partitions.compute()进行分区。因此,我没有运气就尝试了以下方法和其他一些选择:

part = 7
ddf = dd.from_pandas(df, npartitions=part)
location = ddf.map_partitions(lambda df: df.apply(locMp), meta=pd.DataFrame).compute()

# and...

part = 7
ddf = dd.from_pandas(df, npartitions=part)
location = ddf.map_partitions(locMp, meta=pd.DataFrame).compute()

# and simplifying from Dask documentation

part = 7
ddf = dd.from_pandas(df, npartitions=part)
location = ddf.map_partitions(locMp)

我用dask.delayed尝试了其他一些方法,但是似乎没有任何效果。我要么获得了Dask系列,要么获得了一些其他不期望的输出,或者该函数所花费的时间比定期运行它所花费的时间长或更长。如何使用Dask加速自定义DataFrame函数操作并返回干净的Pandas Dataframe?

谢谢

1 个答案:

答案 0 :(得分:1)

您可以尝试让Dask处理应用程序,而不要自己循环:

Get-WmiObject