并行化Pandas DataFrame功能

时间:2018-04-24 06:31:27

标签: python pandas dataframe parallel-processing multiprocessing

任务是创建一个单一的功能

  • 根据用户请求的分区数将行拆分为<{1}}函数
  • 允许将用户指定的函数传递到parallized apply函数

E.g。给出了文字输入,我试过了:

pd.DataFrame.apply

[OUT]:

缺货[8]:

from multiprocessing import Pool

import numpy as np
import pandas as pd

from nltk import word_tokenize

def apply_me(df_this):
    return df_this[0].astype(str).apply(word_tokenize)

def parallelize_apply(df, apply_me, num_partitions):
    df_split = np.array_split(df, num_partitions)
    pool = Pool(num_partitions)
    df = pd.concat(pool.map(apply_me, df_split))
    pool.close()
    pool.join
    return df

text = """Let's try something.
I have to go to sleep.
Today is June 18th and it is Muiriel's birthday!
Muiriel is 20 now.
The password is "Muiriel"."""

df = pd.DataFrame(text.split('\n'))

parallelize_apply(df, apply_me, 2)

我的问题是我是否有办法将0 [Let, 's, try, something, .] 1 [I, have, to, go, to, sleep, .] 2 [Today, is, June, 18th, and, it, is, Muiriel, ... 3 [Muiriel, is, 20, now, .] 4 [The, password, is, ``, Muiriel, '', .] Name: 0, dtype: object 函数放入word_tokenize()而不是将hard_code放入parallize_apply()函数

简化功能,我尝试过:

apply_me()

它实现了相同的输出,但仍然需要让from multiprocessing import Pool import numpy as np import pandas as pd from nltk import word_tokenize def apply_me(df_this, func): return df_this[0].astype(str).apply(func) def parallelize_apply(df, func, num_partitions): df_split = np.array_split(df, num_partitions) with Pool(num_partitions) as pool: _apply_me = partial(apply_me, func=func) df = pd.concat(pool.map(_apply_me, df_split)) pool.join return df.tolist() text = """Let's try something. I have to go to sleep. Today is June 18th and it is Muiriel's birthday! Muiriel is 20 now. The password is "Muiriel".""" df = pd.DataFrame(text.split('\n')) parallelize_apply(df, word_tokenize, 6) 的中间函数apply_me按需运行。有没有办法删除它?

0 个答案:

没有答案