尝试将函数应用于从具有大约150,000行的数据框派生的groupby对象时,我遇到了性能问题。
为简单起见,让我们处理虚拟数据帧a
arrays = [['bar', 'bar','bar', 'baz', 'baz', 'foo', 'foo', 'foo', 'qux', 'qux'],
['one', 'one','two', 'one', 'two', 'one', 'two', 'two', 'one', 'two']]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
a = pd.DataFrame(np.random.random((10,)), index = index)
a[1] = pd.date_range('2017-07-02', periods=10, freq='5min')
a
Out[24]:
0 1
first second
bar one 0.821371 2017-07-02 00:00:00
one 0.312364 2017-07-02 00:05:00
two 0.104821 2017-07-02 00:10:00
baz one 0.839370 2017-07-02 00:15:00
two 0.307262 2017-07-02 00:20:00
foo one 0.719300 2017-07-02 00:25:00
two 0.371118 2017-07-02 00:30:00
two 0.765483 2017-07-02 00:35:00
qux one 0.794236 2017-07-02 00:40:00
two 0.571231 2017-07-02 00:45:00
我想根据此函数中描述的逻辑有条件地填充每个0
- first
组中second
列的底部元素
def myfunc(g):
if( len(g) >= 2): # if each group's length is greater than or equal to 2, then:
if ((g.loc[g.index[-1], 0] > 0.5)): # If the last element of the 0th column of the group > 0.5, then:
time_gap = g.loc[g.index[-1], 1] - g.loc[g.index[-2], 1] # Find the time difference between the last two records in 1st column
g.loc[g.index[-1], 0] = time_gap # and assign it to the last element in the 0th column of that group
else:
g.loc[g.index[-1], 0] = 'ELSE' # Assign ELSE to the last element of the 0th column of the group
return g
应用此函数会产生
a.reset_index().groupby(['first', 'second']).apply(myfunc)
Out[23]:
first second 0 1
0 bar one 0.821371 2017-07-02 00:00:00
1 bar one ELSE 2017-07-02 00:05:00 correct
2 bar two 0.104821 2017-07-02 00:10:00
3 baz one 0.83937 2017-07-02 00:15:00
4 baz two 0.307262 2017-07-02 00:20:00
5 foo one 0.7193 2017-07-02 00:25:00
6 foo two 0.371118 2017-07-02 00:30:00
7 foo two 0 days 00:05:00 2017-07-02 00:35:00 correct
8 qux one 0.794236 2017-07-02 00:40:00
9 qux two 0.571231 2017-07-02 00:45:00
以上结果正是我想要的。问题是这种方法在应用于我有大约150,000行的数据帧时冻结了我的16GB / i5-6200U CPU @ 2.3GHz计算机。
有条件地填充这些元素的最有效方法是什么?(可能)我需要编写一个函数?
注意:我在Windows 10的jupyter笔记本中运行它 - 如果这很重要
答案 0 :(得分:4)
这里有几个问题。
apply
中使用groupby
时,您可以为每个组创建新的数据框。我们可以通过操纵组的索引来提高性能。首先,请复制一份a
,以防翻译中丢失一些内容,我不希望你搞砸a
......就在那里。
a_ = a.copy()
好的,为了加快速度
g = a.groupby(level=['first', 'second'])
我将使用get_value
和set_value
takeable=True
。 takeable
选项允许我使用其他参数作为位置引用。因此,我想确保我的位置正确。
j0 = a.columns.get_loc(0)
j1 = a.columns.get_loc(1)
方便地,g
具有indices
属性,该属性告诉我每个命名组的所有行的位置。我将创建一个名称和索引字典,通过理解来传递长度为2或更长的第一个障碍。
g_ = {n: i for n, i in g.indices.items() if i.size > 1}
您要将不同类型的内容放入0
列,因为我将使用set_value
,所以我最好提前将该列投放为object
。< / p>
a[0] = a[0].astype(object)
现在,我可以遍历通过上述长度障碍的群组。
for n, i in g_.items():
i0, i1 = i[-2:]
cond = a.get_value(i1, j0, takeable=True) > 0.5
if cond:
tgap = a.get_value(i1, j1, takeable=True) - a.get_value(i0, j1, takeable=True)
a.set_value(i1, j0, tgap, takeable=True)
else:
a.set_value(i1, j0, 'ELSE', takeable=True)
全部合作
g = a.groupby(level=['first', 'second'])
j0 = a.columns.get_loc(0)
j1 = a.columns.get_loc(1)
g_ = {n: i for n, i in g.indices.items() if i.size > 1}
a[0] = a[0].astype(object)
for n, i in g_.items():
i0, i1 = i[-2:]
cond = a.get_value(i1, j0, takeable=True) > 0.5
if cond:
tgap = a.get_value(i1, j1, takeable=True) - a.get_value(i0, j1, takeable=True)
a.set_value(i1, j0, tgap, takeable=True)
else:
a.set_value(i1, j0, 'ELSE', takeable=True)
计时
%timeit a.reset_index().groupby(['first', 'second']).apply(myfunc)
100 loops, best of 3: 7.14 ms per loop
%%timeit
a = b.copy()
g = a.groupby(level=['first', 'second'])
j0 = a.columns.get_loc(0)
j1 = a.columns.get_loc(1)
g_ = {n: i for n, i in g.indices.items() if i.size > 1}
a[0] = a[0].astype(object)
for n, i in g_.items():
i0, i1 = i[-2:]
cond = a.get_value(i1, j0, takeable=True) > 0.5
if cond:
tgap = a.get_value(i1, j1, takeable=True) - a.get_value(i0, j1, takeable=True)
a.set_value(i1, j0, tgap, takeable=True)
else:
a.set_value(i1, j0, 'ELSE', takeable=True)
1000 loops, best of 3: 1.01 ms per loop
答案 1 :(得分:0)
我提出了一个矢量化版本,可以通过几个步骤执行转换。如果你有一个相当复杂的转换,我会建议你尝试将它分成几个步骤,你可以在熊猫中进行分析。
你的第一步是在小组中找到两个以上的小组:
to_check = df.groupby(['first','second']).size().apply(lambda x: True if x > 1 else False)
first second
bar one True
two False
baz one False
two False
foo one False
two True
qux one False
two False
dtype: bool
现在我们可以将结果(在索引上)合并回原始的df:
df = df.merge(to_check.rename('check').to_frame(), left_index=True, right_index=True)
one two check
first second
bar one 0.821371 2017-07-02 00:00:00 True
one 0.312364 2017-07-02 00:05:00 True
two 0.104821 2017-07-02 00:10:00 False
baz one 0.839370 2017-07-02 00:15:00 False
two 0.307262 2017-07-02 00:20:00 False
foo one 0.719300 2017-07-02 00:25:00 False
two 0.371118 2017-07-02 00:30:00 True
two 0.765483 2017-07-02 00:35:00 True
qux one 0.794236 2017-07-02 00:40:00 False
two 0.571231 2017-07-02 00:45:00 False
现在我们可以使用'check'列找到我们感兴趣的组,看看我们是否应该进行计算:
to_calc = df.loc[df['check'], 'one'].groupby(['first','second']).apply(lambda x: 1 if x[-1] > 0.5 else 'Else')
first second
bar one Else
foo two 0.394365
现在我们得到了结果,我们可以将它们合并到df:
df = df.merge(to_calc.to_frame(), left_index=True, right_index=True, how='outer')
one two check to_calc
first second
bar one 0.821371 2017-07-02 00:00:00 True Else
one 0.312364 2017-07-02 00:05:00 True Else
two 0.104821 2017-07-02 00:10:00 False NaN
baz one 0.839370 2017-07-02 00:15:00 False NaN
two 0.307262 2017-07-02 00:20:00 False NaN
foo one 0.719300 2017-07-02 00:25:00 False NaN
two 0.371118 2017-07-02 00:30:00 True 1
two 0.765483 2017-07-02 00:35:00 True 1
qux one 0.794236 2017-07-02 00:40:00 False NaN
two 0.571231 2017-07-02 00:45:00 False NaN
我们可以用来执行实际计算:
df_calc = df.loc[df['to_calc']==1,'two'].groupby(level=['first','second']).apply(lambda x: x[-1]-x[-2])
df = df.merge(df_calc.rename('calc').to_frame(), left_index=True, right_index=True, how='outer')
one two check to_calc calc
first second
bar one 0.821371 2017-07-02 00:00:00 True Else NaT
one 0.312364 2017-07-02 00:05:00 True Else NaT
two 0.104821 2017-07-02 00:10:00 False NaN NaT
baz one 0.839370 2017-07-02 00:15:00 False NaN NaT
two 0.307262 2017-07-02 00:20:00 False NaN NaT
foo one 0.719300 2017-07-02 00:25:00 False NaN NaT
two 0.371118 2017-07-02 00:30:00 True 1 00:05:00
two 0.765483 2017-07-02 00:35:00 True 1 00:05:00
qux one 0.794236 2017-07-02 00:40:00 False NaN NaT
two 0.571231 2017-07-02 00:45:00 False NaN NaT
将结果仅写入每个组的最后一个字段现在相当棘手,但可以通过组合pandas nan和重复测试来实现:
calc_select = (~pd.isnull(df['calc']))&~(
df.duplicated(subset=['first','second'],keep='last'))
else_select = (df['to_calc']=='Else')&~(
df.duplicated(subset=['first','second'],keep='last'))
df.loc[else_select, 'one'] = df.loc[else_select, 'to_calc']
df.loc[calc_select, 'one'] = df.loc[calc_select, calc'].astype('O')
first second one two check to_calc calc
0 bar one 0.821371 2017-07-02 00:00:00 True Else NaT
1 bar one Else 2017-07-02 00:05:00 True Else NaT
2 bar two 0.104821 2017-07-02 00:10:00 False NaN NaT
3 baz one 0.83937 2017-07-02 00:15:00 False NaN NaT
4 baz two 0.307262 2017-07-02 00:20:00 False NaN NaT
5 foo one 0.7193 2017-07-02 00:25:00 False NaN NaT
6 foo two 0.371118 2017-07-02 00:30:00 True 1 00:05:00
7 foo two 0 days 00:05:00 2017-07-02 00:35:00 True 1 00:05:00
8 qux one 0.794236 2017-07-02 00:40:00 False NaN NaT
9 qux two 0.571231 2017-07-02 00:45:00 False NaN NaT
对于测试数据,我的解决方案实际上比你的解决方案运行速度慢(18.4毫秒与你的7.08毫秒相比),但我会假设矢量化解决方案在更大的数据集上运行得更快。如果你能为你的数据集提供一些时间,我真的很感兴趣。