Python在Linux上对大型数据框进行多处理

时间:2019-10-13 23:55:32

标签: python pandas python-multiprocessing

如标题所示,我有一个大数据帧(df),需要逐行处理,因为df很大(6 GB),我想利用{ {1}}的python软件包加快了速度,下面是一个玩具示例,鉴于我的写作技巧和任务的复杂性,我将简要描述我要实现的目标并详细介绍代码。

原始数据为multiprocessing,我要从中执行一些逐行分析(顺序无关紧要),这不仅需要焦点行本身,还需要满足特定条件的其他行。以下是玩具数据和我的代码

df

我想做的是添加另外两列import pandas as pd import numpy as np import itertools from multiprocessing import Pool import time import math # a test example start_time = time.time() df = pd.DataFrame({'value': np.random.randint(0, 10, size=30), 'district': (['upper'] * 5 + ['down'] * 5) * 3, 'region': ['A'] * 10 + ['B'] * 10 + ['C'] * 10}) df['row_id'] = df.index print(df) value district region row_id 0 8 upper A 0 1 4 upper A 1 2 0 upper A 2 3 3 upper A 3 4 0 upper A 4 5 0 down A 5 6 3 down A 6 7 7 down A 7 8 1 down A 8 9 7 down A 9 10 7 upper B 10 11 3 upper B 11 12 9 upper B 12 13 8 upper B 13 14 2 upper B 14 15 4 down B 15 16 5 down B 16 17 3 down B 17 18 5 down B 18 19 3 down B 19 20 3 upper C 20 21 1 upper C 21 22 3 upper C 22 23 0 upper C 23 24 3 upper C 24 25 2 down C 25 26 0 down C 26 27 1 down C 27 28 1 down C 28 29 0 down C 29 count_b,它们仅计算落入(值-2,值)和(值,值)范围内的行数+ 2)在同一count_aregion子集中,例如, 第district行的count_b应该为0,因为row_id==0region=='A'中没有行的值是7,该值落在(8-2,8)中。因此所需的输出应为:

district == 'upper'

问题1: 可以将这样的任务向量化吗?

问题2: 我们如何使用 count_a count_b region row_id 0 0 0 A 0 1 0 1 A 1 2 0 0 A 2 3 1 0 A 3 4 0 0 A 4 5 1 0 A 5 6 0 0 A 6 7 0 0 A 7 8 0 1 A 8 9 0 0 A 9 10 1 0 B 10 11 0 1 B 11 12 0 1 B 12 13 1 1 B 13 14 1 0 B 14 15 2 2 B 15 16 0 1 B 16 17 1 0 B 17 18 0 1 B 18 19 1 0 B 19 20 0 0 C 20 21 0 1 C 21 22 0 0 C 22 23 1 0 C 23 24 0 0 C 24 25 0 2 C 25 26 2 0 C 26 27 1 2 C 27 28 1 2 C 28 29 2 0 C 29 来加快速度(已解决)?

我决定选择multiprocessing,因为我不确定如何通过矢量化来完成此操作。解决方案是(基于所提供的答案)

多处理

multiprocessing

由于使用def b_a(input_df,r_d): print('length of input dataframe: ' + str(len(input_df))) # print('region: ' + str(r_d[0]), 'district: ' + str(r_d[1])) sub_df = input_df.loc[(input_df['region'].isin([r_d[0]])) & (input_df['district'].isin([r_d[1]]))] print('length of sliced dataframe: ' + str(len(sub_df))) print(r_d[0],r_d[1]) b_a = pd.DataFrame(columns=['count_a', 'count_b', 'row_id', 'region']) for id in sub_df['row_id']: print('processing row: ' + str(id)) focal_value = sub_df.loc[sub_df['row_id'].isin([id])]['value'] temp_b = sub_df.loc[ (sub_df['value'] > (focal_value - 2).values[0]) & (sub_df['value'] < (focal_value.values[0]))] temp_a = sub_df.loc[ (sub_df['value'] > (focal_value.values[0])) & (sub_df['value'] < (focal_value + 2).values[0])] if len(temp_a): temp_a['count_a'] = temp_a['row_id'].count() else: temp_a = temp_a.append(pd.Series(), ignore_index=True) temp_a = temp_a.reindex( columns=[*temp_a.columns.tolist(), 'count_a'], fill_value=0) print(temp_a) if len(temp_b): temp_b['count_b'] = temp_b['row_id'].count() else: temp_b = temp_b.append(pd.Series(), ignore_index=True) temp_b = temp_b.reindex( columns=[*temp_b.columns.tolist(), 'count_b'], fill_value=0) print(len(temp_a),len(temp_b)) temp_b.drop_duplicates('count_b', inplace=True) temp_a.drop_duplicates('count_a', inplace=True) temp = pd.concat([temp_b[['count_b']].reset_index(drop=True), temp_a[['count_a']].reset_index(drop=True)], axis=1) temp['row_id'] = id temp['region'] = str(r_d[0]) b_a = pd.concat([b_a, temp]) return b_a r_d_list = list(itertools.product(df['region'].unique(),df['district'].unique())) if __name__ == '__main__': P = Pool(3) out = P.starmap(b_a, zip([chunks[r_d_list.index(j)] for j in r_d_list for i in range(len(j))], list(itertools.chain.from_iterable(r_d_list)))) # S3 # out = P.starmap(b_a, zip([df for i in range(len(r_d_list))], r_d_list)) # S2 # out = P.starmap(b_a,zip(df,r_d_list)) # S1 # print(out) P.close() P.join() final = pd.concat(out, ignore_index=True) print(final) final.to_csv('final.csv',index=False) print("--- %s seconds ---" % (time.time() - start_time)) (以及P.starmap)需要使用P.map的所有可能的参数馈给函数,因此解决方案{{1 }}无效,因为b_a实际上在S1的列名和zip(df,r_d_list)中的元素之间产生了zip,这将导致错误df,因为{{1 }}对于功能r_d_list的字面意义上是一个字符串(列名df),可以通过查看AttributeError: 'str' object has no attribute 'loc'的输出进行验证,该输出将产生input_df的列名的长度(在这种情况下为b_a)。可接受的答案通过创建长度与参数列表(print('length of input dataframe: ' + str(len(input_df))))相同的引用数组(input_df)(不确定确切是什么)来纠正此问题。此解决方案效果很好,但在df较大时可能会很慢,因为据我个人所知,它需要在整个数据框中搜索每对参数(S2r_d_list),因此我想出了一个修改后的版本,该版本根据dfregion将数据分成多个块,然后在每个块中而不是整个数据帧中进行搜索(S3)。对我来说,此解决方案将运行时间提高了20%,请参见下面的代码:

distrcit

regiondistrcit之间添加此内容,并记得在region = df['region'].unique() chunk_numbers = 3 chunk_region = math.ceil(len(region) / chunk_numbers) chunks = list() r_d_list = list() row_count = 0 for i in range(chunk_numbers): print(i) if i < chunk_numbers-1: regions = region[(i*chunk_region):((i+1)*chunk_region)] temp = df.loc[df['region'].isin(regions.tolist())] chunks.append(temp) r_d_list.append(list(itertools.product(regions,temp['district'].unique()))) del temp else: regions = region[(i * chunk_region):len(region)] temp = df.loc[df['region'].isin(regions.tolist())] chunks.append(temp) r_d_list.append(list(itertools.product(regions,temp['district'].unique()))) del temp row_count = row_count + len(chunks[i]) print(row_count) 之前注释掉print(df)

感谢这个美好的社区,我现在有一个可行的解决方案,我更新了我的问题,为将来可能遇到相同问题的人们提供一些材料,并更好地提出问题以获得更好的解决方案。

3 个答案:

答案 0 :(得分:0)

更改

out = P.starmap(b_a,zip(df,r_d_list))

进入

out = P.starmap(b_a, zip([df for i in range(len(r_d_list))], r_d_list))

输出如下:

length of input dataframe: 300
region: B district: down
length of input dataframe: 300
region: C district: upper
length of sliced dataframe: 50
length of input dataframe: 300
region: C district: down
length of sliced dataframe: 50
length of sliced dataframe: 50
6
[  count_a count_b region row_id
0       6       7      A      0,   count_a count_b region row_id
0       2       4      A     50,   count_a count_b region row_id
0       1       4      B    100,   count_a count_b region row_id
0       7       4      B    150,   count_a count_b region row_id
0       4       9      C    200,   count_a count_b region row_id
0       4       4      C    250]

df数组维护引用

dfa = [df for i in range(len(r_d_list))]

for i in dfa:
    print(['id(i): ', id(i)])

上面的输出如下:

['id(i): ', 4427699200]
['id(i): ', 4427699200]
['id(i): ', 4427699200]
['id(i): ', 4427699200]
['id(i): ', 4427699200]
['id(i): ', 4427699200]

zip(df, r_d_list)zip(dfa, r_d_list)之间的差异

https://docs.python.org/3.3/library/functions.html#zipexample上查看zip,以了解zip的工作及其如何构造结果。

list(zip(df, r_d_list))返回以下内容:

[
('value', ('A', 'upper')),
('district', ('A', 'down')),
('region', ('B', 'upper')),
('row_id', ('B', 'down'))
]

list(zip(dfa, r_d_list))返回以下内容:

[
(fa, ('A', 'upper')),
(fa, ('A', 'down')),
(fa, ('B', 'upper')),
(fa, ('B', 'down'))
]

您可以在Python multiprocessing pool.map for multiple argumentspool.starmap上找到一些示例。

更新了工作代码

import pandas as pd
import numpy as np
import itertools
from multiprocessing import Pool

df = pd.DataFrame({'value': np.random.randint(0, 10, size=300),
                   'district': (['upper'] * 50 + ['down'] * 50) * 3,
                   'region': ['A'] * 100 + ['B'] * 100 + ['C'] * 100})

df['row_id'] = df.index

# b_a = pd.DataFrame(columns=['count_a', 'count_b', 'row_id', 'region'])


# solution 2: multi processing
def b_a(input_df, r_d):
#    print('length of input dataframe: ' + str(len(input_df)))
#    print('region: ' + str(r_d[0]), 'district: ' + str(r_d[1]))

    sub_df = input_df.loc[(input_df['region'].isin([r_d[0]])) & (input_df['district'].isin([r_d[1]]))]  # subset data that in certain region and district

#    print('length of sliced dataframe: ' + str(len(sub_df)))

    b_a = pd.DataFrame(columns=['count_a', 'count_b', 'row_id', 'region'])  # an empty data frame to store result

    for id in sub_df['row_id']:
        focal_value = sub_df.loc[sub_df['row_id'].isin([id])]['value']
        temp_b = sub_df.loc[
            (sub_df['value'] > (focal_value - 2).values[0]) & (sub_df['value'] < (focal_value.values[0]))]
        temp_a = sub_df.loc[
            (sub_df['value'] > (focal_value.values[0])) & (sub_df['value'] < (focal_value + 2).values[0])]

        if len(temp_a):
            temp_a['count_a'] = temp_a['row_id'].count()
        else:
            temp_a = temp_a.reindex(
                columns=[*temp_a.columns.tolist(), 'count_a'], fill_value=0)

        if len(temp_b):
            temp_b['count_b'] = temp_b['row_id'].count()
        else:
            temp_b = temp_b.reindex(
                columns=[*temp_b.columns.tolist(), 'count_b'], fill_value=0)

        temp_b.drop_duplicates('count_b', inplace=True)
        temp_a.drop_duplicates('count_a', inplace=True)
        temp = pd.concat([temp_b[['count_b']].reset_index(drop=True),
                          temp_a[['count_a']].reset_index(drop=True)], axis=1)

        temp['row_id'] = id
        temp['region'] = str(r_d[0])

        b_a = pd.concat([b_a, temp])

    return b_a


r_d_list = list(itertools.product(df['region'].unique(), df['district'].unique()))

# dfa = [df for i in range(len(r_d_list))]

#for i in dfa:
#    print(['id(i): ', id(i)])

if __name__ == '__main__':
    P = Pool(3)
    out = P.starmap(b_a, zip([df for i in range(len(r_d_list))], r_d_list))
    # print(len(out))
    P.close()
    P.join()

    final = pd.concat(out, ignore_index=True)

    print(final)

final的输出:

    count_a count_b region row_id
0         4       6      A      0
1         5       4      A      1
2       NaN       5      A      2
3         5       8      A      3
4         5     NaN      A      4
..      ...     ...    ...    ...
295       2       7      C    295
296       6     NaN      C    296
297       6       6      C    297
298       5       5      C    298
299       6       6      C    299

[300 rows x 4 columns]

答案 1 :(得分:0)

我认为这里还有改进的空间。我建议您在groupby

中定义一个函数
import os
import pandas as pd
import numpy as np
import dask.dataframe as dd
N = 30_000
# Now the example is reproducible
np.random.seed(0)
df = pd.DataFrame({'value': np.random.randint(0, 10, size=N),
                   'district': (['upper'] * 5 + ['down'] * 5) * 3000,
                   'region': ['A'] * 10_000 + ['B'] * 10_000 + ['C'] * 10_000,
                   'row_id': np.arange(N)})

以下函数为给定组中的每一行返回count_acount_b

def fun(vec):
    out = []
    for i, v in enumerate(vec):
        a = vec[:i] + vec[i+1:]
        count_a = np.isin(a, [v-2, 2]).sum()
        count_b = np.isin(a, [v, v+2]).sum()
        out.append([count_a, count_b])
    return out

熊猫

%%time
df[["count_a", "count_b"]] = df.groupby(["district", "region"])["value"]\
                               .apply(lambda x: fun(x))\
                               .explode().apply(pd.Series)\
                               .reset_index(drop=True)
CPU times: user 22.6 s, sys: 174 ms, total: 22.8 s
Wall time: 22.8 s

黄昏

现在,您需要再次创建df,然后可以使用dask。这是我想到的第一件事。当然,有更好/更快的方法。

ddf = dd.from_pandas(df, npartitions=os.cpu_count())

df[["count_a", "count_b"]] = ddf.groupby(["district", "region"])["value"]\
                                .apply(lambda x: fun(x.tolist()),
                                       meta=('x', 'f8'))\
                                .compute(scheduler='processes')\
                                .explode().apply(pd.Series)\
                                .reset_index(drop=True)
CPU times: user 6.92 s, sys: 114 ms, total: 7.04 s
Wall time: 13.4 s

多处理

在这种情况下,再次需要创建df。这里的诀窍是将df拆分为lst个列表df

import multiprocessing as mp
def parallelize(fun, vec, cores):
    with mp.Pool(cores) as p:
        res = p.map(fun, vec)
    return res

def par_fun(d):
    d = d.reset_index(drop=True)
    o = pd.DataFrame(fun(d["value"].tolist()),
                     columns=["count_a", "count_b"])
    return pd.concat([d,o], axis=1)
%%time
lst = [l[1] for l in list(df.groupby(["district", "region"]))]

out = parallelize(par_fun, lst, os.cpu_count())
out = pd.concat(out, ignore_index=True)
CPU times: user 152 ms, sys: 49.7 ms, total: 202 ms
Wall time: 5 s

最终,您可以使用fun来改善功能numba

答案 2 :(得分:-2)

由于GIL多重处理实际上并未使用两个不同的线程。在受CPU约束的进程中,使用多处理不会给您带来太多(如果有的话)额外的性能。

有一个名为dask的库,该库的设计看起来像大熊猫,但在引擎盖下却进行了许多异步和分块操作,而这并不能更快地处理出价数据帧。