当前这是我的功能,我想知道是否有一种方法可以矢量化/提高其效率,而不必使用itertuples()遍历DataFrame?目前并没有那么慢,但是有超过250000行。
def function(dataframe, *actions):
sources_list = []
for dict_row in dataframe.itertuples(index=False):
for entry in dict_row:
temp_json_data = json.loads(entry)
for dict_entry in temp_json_data:
if dict_entry['action'] in actions:
sources_list.append(dict_entry)
return sources_list
此函数基本上试图实现的功能是,它遍历DataFrame中的每一行,dict_row成为Pandas对象,并且每个条目都是一个字符串。我们使用json.loads()将每个输入字符串转换为字典,并尝试查看新创建的字典是否包含* actions参数列表一部分的键,如果有的话,我们将其追加到列表中。
这是一个有代表性的数据集:
actions
0 [{"E": 24, "action": "views"}, {"F": 22, "action": "noise"}, {"H": 39, "action": "conversions"}]
1 [{"B": 79, "action": "clicks"}, {"H": 3, "action": "conversions"}, {"G": 68, "action": "junk"}]
2 [{"E": 10, "action": "views"}, {"D": 41, "action": "views"}, {"J": 52, "action": "conversions"}]
3 [{"A": 47, "action": "clicks"}, {"E": 93, "action": "junk"}, {"D": 54, "action": "views"}]
4 [{"H": 16, "action": "views"}, {"G": 41, "action": "conversions"}, {"C": 80, "action": "junk"}]
5 [{"J": 57, "action": "noise"}, {"E": 93, "action": "views"}, {"H": 20, "action": "conversions"}]
6 [{"F": 5, "action": "junk"}, {"A": 11, "action": "junk"}, {"G": 98, "action": "junk"}]
7 [{"C": 36, "action": "junk"}, {"G": 38, "action": "clicks"}, {"D": 71, "action": "junk"}]
8 [{"A": 22, "action": "noise"}, {"C": 9, "action": "clicks"}, {"E": 94, "action": "conversions"}]
9 [{"E": 64, "action": "clicks"}, {"J": 80, "action": "junk"}, {"E": 77, "action": "conversions"}]
可以使用以下代码段重新创建:
data = [["[{\"E\": 24, \"action\": \"views\"}, {\"F\": 22, \"action\": \"noise\"}, {\"H\": 39, \"action\": \"conversions\"}]"],
["[{\"B\": 79, \"action\": \"clicks\"}, {\"H\": 3, \"action\": \"conversions\"}, {\"G\": 68, \"action\": \"junk\" }]"],
["[{\"E\": 10, \"action\": \"views\"}, {\"D\": 41, \"action\": \"views\"}, {\"J\": 52, \"action\": \"conversions\"}]"],
["[{\"A\": 47, \"action\": \"clicks\"}, {\"E\": 93, \"action\": \"junk\"}, {\"D\": 54, \"action\": \"views\" }]"],
["[{\"H\": 16, \"action\": \"views\"}, {\"G\": 41, \"action\": \"conversions\"}, {\"C\": 80, \"action\": \"junk\" }]"],
["[{\"J\": 57, \"action\": \"noise\"}, {\"E\": 93, \"action\": \"views\"}, {\"H\": 20, \"action\": \"conversions\"}]"],
["[{\"F\": 5, \"action\": \"junk\"}, {\"A\": 11, \"action\": \"junk\"}, {\"G\": 98, \"action\": \"junk\" }]"],
["[{\"C\": 36, \"action\": \"junk\"}, {\"G\": 38, \"action\": \"clicks\"}, {\"D\": 71, \"action\": \"junk\" }]"],
["[{\"A\": 22, \"action\": \"noise\"}, {\"C\": 9, \"action\": \"clicks\"}, {\"E\": 94, \"action\": \"conversions\"}]"],
["[{\"E\": 64, \"action\": \"clicks\"}, {\"J\": 80, \"action\": \"junk\"}, {\"E\": 77, \"action\": \"conversions\"}]"]]
df = pd.DataFrame(data=data, columns=['actions'])
答案 0 :(得分:0)
Pandas不适用于将可迭代对象存储为值。通过在数据进入数据帧之前重构 ,可以获得更好的性能。
也就是说,并不是所有的东西都丢失了。您可以通过将DataFrame分成多个块来采用多处理来并行化任务。
import json
import multiprocessing
from itertools import chain
import pandas as pd
def function(dataframe, *actions):
sources_list = []
for dict_row in dataframe.itertuples(index=False):
for entry in dict_row:
temp_json_data = json.loads(entry)
for dict_entry in temp_json_data:
if dict_entry['action'] in actions:
sources_list.append(dict_entry)
return sources_list
class Parser:
def __init__(self, dataframe, *actions):
self.dataframe = dataframe
self.actions = actions
def helper(self, idx0, idxf):
result = []
for datapoint in chain(*self.dataframe.loc[idx0:idxf, 'actions'].apply(json.loads)):
if datapoint['action'] in self.actions:
result.append(datapoint)
return result
def run(self, P=1):
N = self.dataframe.shape[0]
if P > 1:
with multiprocessing.Pool(processes=P) as pool:
n = N // P
results = pool.starmap(self.helper, ([n*i, min(n*(i+1)-1, N)] for i in range(P)))
else:
results = [self.helper(0, N)]
return list(chain(*results))
data = [["[{\"E\": 24, \"action\": \"views\"}, {\"F\": 22, \"action\": \"noise\"}, {\"H\": 39, \"action\": \"conversions\"}]"],
["[{\"B\": 79, \"action\": \"clicks\"}, {\"H\": 3, \"action\": \"conversions\"}, {\"G\": 68, \"action\": \"junk\" }]"],
["[{\"E\": 10, \"action\": \"views\"}, {\"D\": 41, \"action\": \"views\"}, {\"J\": 52, \"action\": \"conversions\"}]"],
["[{\"A\": 47, \"action\": \"clicks\"}, {\"E\": 93, \"action\": \"junk\"}, {\"D\": 54, \"action\": \"views\" }]"],
["[{\"H\": 16, \"action\": \"views\"}, {\"G\": 41, \"action\": \"conversions\"}, {\"C\": 80, \"action\": \"junk\" }]"],
["[{\"J\": 57, \"action\": \"noise\"}, {\"E\": 93, \"action\": \"views\"}, {\"H\": 20, \"action\": \"conversions\"}]"],
["[{\"F\": 5, \"action\": \"junk\"}, {\"A\": 11, \"action\": \"junk\"}, {\"G\": 98, \"action\": \"junk\" }]"],
["[{\"C\": 36, \"action\": \"junk\"}, {\"G\": 38, \"action\": \"clicks\"}, {\"D\": 71, \"action\": \"junk\" }]"],
["[{\"A\": 22, \"action\": \"noise\"}, {\"C\": 9, \"action\": \"clicks\"}, {\"E\": 94, \"action\": \"conversions\"}]"],
["[{\"E\": 64, \"action\": \"clicks\"}, {\"J\": 80, \"action\": \"junk\"}, {\"E\": 77, \"action\": \"conversions\"}]"]]
actions = ['views', 'clicks', 'conversions']
df = pd.DataFrame(data=data*25000, columns=['actions'])
请注意,我尝试通过将示例10行数据集复制25000x来模拟250000行数据集。 (如果您尝试对小型数据集进行多处理,则可能会表现得更糟。收益只能大规模获得。)
仅使用一个过程,该方法会更快一些。
In [2]: %timeit function(df, *actions)
2.41 s ± 10.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [3]: %timeit Parser(df, *actions).run(P=1)
2.1 s ± 3.23 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
根据您的CPU,您也许可以将运行时间减少100%或更多。但是,如果进程过多,则会使硬件饱和,如下所示。
In [4]: %timeit Parser(df, *actions).run(P=2)
1.67 s ± 6.48 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [5]: %timeit Parser(df, *actions).run(P=4)
1.04 s ± 33.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [6]: %timeit Parser(df, *actions).run(P=8)
1.06 s ± 23.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [7]: %timeit Parser(df, *actions).run(P=16)
1.11 s ± 37.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [8]: %timeit Parser(df, *actions).run(P=32)
1.34 s ± 38.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)