如标题所示,我有一个大数据帧(df
),需要逐行处理,因为df
很大(6 GB),我想利用{ {1}}的python软件包加快了速度,下面是一个玩具示例,鉴于我的写作技巧和任务的复杂性,我将简要描述我要实现的目标并详细介绍代码。
原始数据为multiprocessing
,我要从中执行一些逐行分析(顺序无关紧要),这不仅需要焦点行本身,还需要满足特定条件的其他行。以下是玩具数据和我的代码
df
我想做的是添加另外两列import pandas as pd
import numpy as np
import itertools
from multiprocessing import Pool
import time
import math
# a test example
start_time = time.time()
df = pd.DataFrame({'value': np.random.randint(0, 10, size=30),
'district': (['upper'] * 5 + ['down'] * 5) * 3,
'region': ['A'] * 10 + ['B'] * 10 + ['C'] * 10})
df['row_id'] = df.index
print(df)
value district region row_id
0 8 upper A 0
1 4 upper A 1
2 0 upper A 2
3 3 upper A 3
4 0 upper A 4
5 0 down A 5
6 3 down A 6
7 7 down A 7
8 1 down A 8
9 7 down A 9
10 7 upper B 10
11 3 upper B 11
12 9 upper B 12
13 8 upper B 13
14 2 upper B 14
15 4 down B 15
16 5 down B 16
17 3 down B 17
18 5 down B 18
19 3 down B 19
20 3 upper C 20
21 1 upper C 21
22 3 upper C 22
23 0 upper C 23
24 3 upper C 24
25 2 down C 25
26 0 down C 26
27 1 down C 27
28 1 down C 28
29 0 down C 29
和count_b
,它们仅计算落入(值-2,值)和(值,值)范围内的行数+ 2)在同一count_a
和region
子集中,例如,
第district
行的count_b
应该为0,因为row_id==0
和region=='A'
中没有行的值是7,该值落在(8-2,8)中。因此所需的输出应为:
district == 'upper'
问题1: 可以将这样的任务向量化吗?
问题2: 我们如何使用 count_a count_b region row_id
0 0 0 A 0
1 0 1 A 1
2 0 0 A 2
3 1 0 A 3
4 0 0 A 4
5 1 0 A 5
6 0 0 A 6
7 0 0 A 7
8 0 1 A 8
9 0 0 A 9
10 1 0 B 10
11 0 1 B 11
12 0 1 B 12
13 1 1 B 13
14 1 0 B 14
15 2 2 B 15
16 0 1 B 16
17 1 0 B 17
18 0 1 B 18
19 1 0 B 19
20 0 0 C 20
21 0 1 C 21
22 0 0 C 22
23 1 0 C 23
24 0 0 C 24
25 0 2 C 25
26 2 0 C 26
27 1 2 C 27
28 1 2 C 28
29 2 0 C 29
来加快速度(已解决)?
我决定选择multiprocessing
,因为我不确定如何通过矢量化来完成此操作。解决方案是(基于所提供的答案)
multiprocessing
由于使用def b_a(input_df,r_d):
print('length of input dataframe: ' + str(len(input_df)))
# print('region: ' + str(r_d[0]), 'district: ' + str(r_d[1]))
sub_df = input_df.loc[(input_df['region'].isin([r_d[0]])) & (input_df['district'].isin([r_d[1]]))]
print('length of sliced dataframe: ' + str(len(sub_df)))
print(r_d[0],r_d[1])
b_a = pd.DataFrame(columns=['count_a', 'count_b', 'row_id', 'region'])
for id in sub_df['row_id']:
print('processing row: ' + str(id))
focal_value = sub_df.loc[sub_df['row_id'].isin([id])]['value']
temp_b = sub_df.loc[
(sub_df['value'] > (focal_value - 2).values[0]) & (sub_df['value'] < (focal_value.values[0]))]
temp_a = sub_df.loc[
(sub_df['value'] > (focal_value.values[0])) & (sub_df['value'] < (focal_value + 2).values[0])]
if len(temp_a):
temp_a['count_a'] = temp_a['row_id'].count()
else:
temp_a = temp_a.append(pd.Series(), ignore_index=True)
temp_a = temp_a.reindex(
columns=[*temp_a.columns.tolist(), 'count_a'], fill_value=0)
print(temp_a)
if len(temp_b):
temp_b['count_b'] = temp_b['row_id'].count()
else:
temp_b = temp_b.append(pd.Series(), ignore_index=True)
temp_b = temp_b.reindex(
columns=[*temp_b.columns.tolist(), 'count_b'], fill_value=0)
print(len(temp_a),len(temp_b))
temp_b.drop_duplicates('count_b', inplace=True)
temp_a.drop_duplicates('count_a', inplace=True)
temp = pd.concat([temp_b[['count_b']].reset_index(drop=True),
temp_a[['count_a']].reset_index(drop=True)], axis=1)
temp['row_id'] = id
temp['region'] = str(r_d[0])
b_a = pd.concat([b_a, temp])
return b_a
r_d_list = list(itertools.product(df['region'].unique(),df['district'].unique()))
if __name__ == '__main__':
P = Pool(3)
out = P.starmap(b_a, zip([chunks[r_d_list.index(j)] for j in r_d_list for i in range(len(j))],
list(itertools.chain.from_iterable(r_d_list)))) # S3
# out = P.starmap(b_a, zip([df for i in range(len(r_d_list))], r_d_list)) # S2
# out = P.starmap(b_a,zip(df,r_d_list)) # S1
# print(out)
P.close()
P.join()
final = pd.concat(out, ignore_index=True)
print(final)
final.to_csv('final.csv',index=False)
print("--- %s seconds ---" % (time.time() - start_time))
(以及P.starmap
)需要使用P.map
的所有可能的对参数馈给函数,因此解决方案{{1 }}无效,因为b_a
实际上在S1
的列名和zip(df,r_d_list)
中的元素之间产生了zip,这将导致错误df
,因为{{1 }}对于功能r_d_list
的字面意义上是一个字符串(列名df),可以通过查看AttributeError: 'str' object has no attribute 'loc'
的输出进行验证,该输出将产生input_df
的列名的长度(在这种情况下为b_a
)。可接受的答案通过创建长度与参数列表(print('length of input dataframe: ' + str(len(input_df)))
)相同的引用数组(input_df
)(不确定确切是什么)来纠正此问题。此解决方案效果很好,但在df
较大时可能会很慢,因为据我个人所知,它需要在整个数据框中搜索每对参数(S2
和r_d_list
),因此我想出了一个修改后的版本,该版本根据df
和region
将数据分成多个块,然后在每个块中而不是整个数据帧中进行搜索(S3)。对我来说,此解决方案将运行时间提高了20%,请参见下面的代码:
distrcit
在region
和distrcit
之间添加此内容,并记得在region = df['region'].unique()
chunk_numbers = 3
chunk_region = math.ceil(len(region) / chunk_numbers)
chunks = list()
r_d_list = list()
row_count = 0
for i in range(chunk_numbers):
print(i)
if i < chunk_numbers-1:
regions = region[(i*chunk_region):((i+1)*chunk_region)]
temp = df.loc[df['region'].isin(regions.tolist())]
chunks.append(temp)
r_d_list.append(list(itertools.product(regions,temp['district'].unique())))
del temp
else:
regions = region[(i * chunk_region):len(region)]
temp = df.loc[df['region'].isin(regions.tolist())]
chunks.append(temp)
r_d_list.append(list(itertools.product(regions,temp['district'].unique())))
del temp
row_count = row_count + len(chunks[i])
print(row_count)
之前注释掉print(df)
。
感谢这个美好的社区,我现在有一个可行的解决方案,我更新了我的问题,为将来可能遇到相同问题的人们提供一些材料,并更好地提出问题以获得更好的解决方案。
答案 0 :(得分:0)
更改
out = P.starmap(b_a,zip(df,r_d_list))
进入
out = P.starmap(b_a, zip([df for i in range(len(r_d_list))], r_d_list))
输出如下:
length of input dataframe: 300
region: B district: down
length of input dataframe: 300
region: C district: upper
length of sliced dataframe: 50
length of input dataframe: 300
region: C district: down
length of sliced dataframe: 50
length of sliced dataframe: 50
6
[ count_a count_b region row_id
0 6 7 A 0, count_a count_b region row_id
0 2 4 A 50, count_a count_b region row_id
0 1 4 B 100, count_a count_b region row_id
0 7 4 B 150, count_a count_b region row_id
0 4 9 C 200, count_a count_b region row_id
0 4 4 C 250]
df
数组维护引用:
dfa = [df for i in range(len(r_d_list))]
for i in dfa:
print(['id(i): ', id(i)])
上面的输出如下:
['id(i): ', 4427699200]
['id(i): ', 4427699200]
['id(i): ', 4427699200]
['id(i): ', 4427699200]
['id(i): ', 4427699200]
['id(i): ', 4427699200]
zip(df, r_d_list)
和zip(dfa, r_d_list)
之间的差异:
在https://docs.python.org/3.3/library/functions.html#zip的example
上查看zip
,以了解zip
的工作及其如何构造结果。
list(zip(df, r_d_list))
返回以下内容:
[
('value', ('A', 'upper')),
('district', ('A', 'down')),
('region', ('B', 'upper')),
('row_id', ('B', 'down'))
]
list(zip(dfa, r_d_list))
返回以下内容:
[
(fa, ('A', 'upper')),
(fa, ('A', 'down')),
(fa, ('B', 'upper')),
(fa, ('B', 'down'))
]
您可以在Python multiprocessing pool.map for multiple arguments的pool.starmap
上找到一些示例。
更新了工作代码:
import pandas as pd
import numpy as np
import itertools
from multiprocessing import Pool
df = pd.DataFrame({'value': np.random.randint(0, 10, size=300),
'district': (['upper'] * 50 + ['down'] * 50) * 3,
'region': ['A'] * 100 + ['B'] * 100 + ['C'] * 100})
df['row_id'] = df.index
# b_a = pd.DataFrame(columns=['count_a', 'count_b', 'row_id', 'region'])
# solution 2: multi processing
def b_a(input_df, r_d):
# print('length of input dataframe: ' + str(len(input_df)))
# print('region: ' + str(r_d[0]), 'district: ' + str(r_d[1]))
sub_df = input_df.loc[(input_df['region'].isin([r_d[0]])) & (input_df['district'].isin([r_d[1]]))] # subset data that in certain region and district
# print('length of sliced dataframe: ' + str(len(sub_df)))
b_a = pd.DataFrame(columns=['count_a', 'count_b', 'row_id', 'region']) # an empty data frame to store result
for id in sub_df['row_id']:
focal_value = sub_df.loc[sub_df['row_id'].isin([id])]['value']
temp_b = sub_df.loc[
(sub_df['value'] > (focal_value - 2).values[0]) & (sub_df['value'] < (focal_value.values[0]))]
temp_a = sub_df.loc[
(sub_df['value'] > (focal_value.values[0])) & (sub_df['value'] < (focal_value + 2).values[0])]
if len(temp_a):
temp_a['count_a'] = temp_a['row_id'].count()
else:
temp_a = temp_a.reindex(
columns=[*temp_a.columns.tolist(), 'count_a'], fill_value=0)
if len(temp_b):
temp_b['count_b'] = temp_b['row_id'].count()
else:
temp_b = temp_b.reindex(
columns=[*temp_b.columns.tolist(), 'count_b'], fill_value=0)
temp_b.drop_duplicates('count_b', inplace=True)
temp_a.drop_duplicates('count_a', inplace=True)
temp = pd.concat([temp_b[['count_b']].reset_index(drop=True),
temp_a[['count_a']].reset_index(drop=True)], axis=1)
temp['row_id'] = id
temp['region'] = str(r_d[0])
b_a = pd.concat([b_a, temp])
return b_a
r_d_list = list(itertools.product(df['region'].unique(), df['district'].unique()))
# dfa = [df for i in range(len(r_d_list))]
#for i in dfa:
# print(['id(i): ', id(i)])
if __name__ == '__main__':
P = Pool(3)
out = P.starmap(b_a, zip([df for i in range(len(r_d_list))], r_d_list))
# print(len(out))
P.close()
P.join()
final = pd.concat(out, ignore_index=True)
print(final)
final
的输出:
count_a count_b region row_id
0 4 6 A 0
1 5 4 A 1
2 NaN 5 A 2
3 5 8 A 3
4 5 NaN A 4
.. ... ... ... ...
295 2 7 C 295
296 6 NaN C 296
297 6 6 C 297
298 5 5 C 298
299 6 6 C 299
[300 rows x 4 columns]
答案 1 :(得分:0)
我认为这里还有改进的空间。我建议您在groupby
import os
import pandas as pd
import numpy as np
import dask.dataframe as dd
N = 30_000
# Now the example is reproducible
np.random.seed(0)
df = pd.DataFrame({'value': np.random.randint(0, 10, size=N),
'district': (['upper'] * 5 + ['down'] * 5) * 3000,
'region': ['A'] * 10_000 + ['B'] * 10_000 + ['C'] * 10_000,
'row_id': np.arange(N)})
以下函数为给定组中的每一行返回count_a
和count_b
def fun(vec):
out = []
for i, v in enumerate(vec):
a = vec[:i] + vec[i+1:]
count_a = np.isin(a, [v-2, 2]).sum()
count_b = np.isin(a, [v, v+2]).sum()
out.append([count_a, count_b])
return out
%%time
df[["count_a", "count_b"]] = df.groupby(["district", "region"])["value"]\
.apply(lambda x: fun(x))\
.explode().apply(pd.Series)\
.reset_index(drop=True)
CPU times: user 22.6 s, sys: 174 ms, total: 22.8 s
Wall time: 22.8 s
现在,您需要再次创建df
,然后可以使用dask
。这是我想到的第一件事。当然,有更好/更快的方法。
ddf = dd.from_pandas(df, npartitions=os.cpu_count())
df[["count_a", "count_b"]] = ddf.groupby(["district", "region"])["value"]\
.apply(lambda x: fun(x.tolist()),
meta=('x', 'f8'))\
.compute(scheduler='processes')\
.explode().apply(pd.Series)\
.reset_index(drop=True)
CPU times: user 6.92 s, sys: 114 ms, total: 7.04 s
Wall time: 13.4 s
在这种情况下,再次需要创建df
。这里的诀窍是将df
拆分为lst
个列表df
。
import multiprocessing as mp
def parallelize(fun, vec, cores):
with mp.Pool(cores) as p:
res = p.map(fun, vec)
return res
def par_fun(d):
d = d.reset_index(drop=True)
o = pd.DataFrame(fun(d["value"].tolist()),
columns=["count_a", "count_b"])
return pd.concat([d,o], axis=1)
%%time
lst = [l[1] for l in list(df.groupby(["district", "region"]))]
out = parallelize(par_fun, lst, os.cpu_count())
out = pd.concat(out, ignore_index=True)
CPU times: user 152 ms, sys: 49.7 ms, total: 202 ms
Wall time: 5 s
最终,您可以使用fun
来改善功能numba
。
答案 2 :(得分:-2)