Question

尝试将函数应用于从具有大约150,000行的数据框派生的groupby对象时，我遇到了性能问题。

为简单起见，让我们处理虚拟数据帧a

arrays = [['bar', 'bar','bar', 'baz', 'baz', 'foo', 'foo', 'foo', 'qux', 'qux'],
             ['one', 'one','two', 'one', 'two', 'one', 'two', 'two', 'one', 'two']]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
a = pd.DataFrame(np.random.random((10,)), index = index)
a[1] = pd.date_range('2017-07-02', periods=10, freq='5min')


a
Out[24]: 
                     0                   1
first second                              
bar   one     0.821371 2017-07-02 00:00:00
      one     0.312364 2017-07-02 00:05:00
      two     0.104821 2017-07-02 00:10:00
baz   one     0.839370 2017-07-02 00:15:00
      two     0.307262 2017-07-02 00:20:00
foo   one     0.719300 2017-07-02 00:25:00
      two     0.371118 2017-07-02 00:30:00
      two     0.765483 2017-07-02 00:35:00
qux   one     0.794236 2017-07-02 00:40:00
      two     0.571231 2017-07-02 00:45:00

我想根据此函数中描述的逻辑有条件地填充每个0 - first组中second列的底部元素

def myfunc(g):


if( len(g) >= 2): # if each group's length is greater than or equal to 2, then:

    if ((g.loc[g.index[-1], 0] > 0.5)): # If the last element of the 0th column of the group > 0.5, then:

        time_gap = g.loc[g.index[-1], 1] - g.loc[g.index[-2], 1] # Find the time difference between the last two records in 1st column

        g.loc[g.index[-1], 0] = time_gap # and assign it to the last element in the 0th column of that group

    else: 
        g.loc[g.index[-1], 0] = 'ELSE' # Assign ELSE to the last element of the 0th column of the group    


return g

应用此函数会产生

a.reset_index().groupby(['first', 'second']).apply(myfunc)
Out[23]: 
  first second                0                   1
0   bar    one         0.821371 2017-07-02 00:00:00
1   bar    one             ELSE 2017-07-02 00:05:00  correct
2   bar    two         0.104821 2017-07-02 00:10:00
3   baz    one          0.83937 2017-07-02 00:15:00
4   baz    two         0.307262 2017-07-02 00:20:00
5   foo    one           0.7193 2017-07-02 00:25:00
6   foo    two         0.371118 2017-07-02 00:30:00
7   foo    two  0 days 00:05:00 2017-07-02 00:35:00  correct
8   qux    one         0.794236 2017-07-02 00:40:00
9   qux    two         0.571231 2017-07-02 00:45:00

以上结果正是我想要的。问题是这种方法在应用于我有大约150,000行的数据帧时冻结了我的16GB / i5-6200U CPU @ 2.3GHz计算机。

有条件地填充这些元素的最有效方法是什么？（可能）我需要编写一个函数？

注意：我在Windows 10的jupyter笔记本中运行它 - 如果这很重要

Answer 1

这里有几个问题。

您正在通过应用从组内编辑数据框。这是很多调试的必要条件。
在apply中使用groupby时，您可以为每个组创建新的数据框。我们可以通过操纵组的索引来提高性能。
您无需重置索引即可按索引级别进行分组

首先，请复制一份a，以防翻译中丢失一些内容，我不希望你搞砸a ......就在那里。

a_ = a.copy()

好的，为了加快速度

g = a.groupby(level=['first', 'second'])

我将使用get_value和set_value takeable=True。 takeable选项允许我使用其他参数作为位置引用。因此，我想确保我的位置正确。

j0 = a.columns.get_loc(0)
j1 = a.columns.get_loc(1)

方便地，g具有indices属性，该属性告诉我每个命名组的所有行的位置。我将创建一个名称和索引字典，通过理解来传递长度为2或更长的第一个障碍。

g_ = {n: i for n, i in g.indices.items() if i.size > 1}

您要将不同类型的内容放入0列，因为我将使用set_value，所以我最好提前将该列投放为object。< / p>

a[0] = a[0].astype(object)

现在，我可以遍历通过上述长度障碍的群组。

for n, i in g_.items():
    i0, i1 = i[-2:]
    cond = a.get_value(i1, j0, takeable=True) > 0.5
    if cond:
        tgap = a.get_value(i1, j1, takeable=True) - a.get_value(i0, j1, takeable=True)
        a.set_value(i1, j0, tgap, takeable=True)
    else:
        a.set_value(i1, j0, 'ELSE', takeable=True)

全部合作

g = a.groupby(level=['first', 'second'])

j0 = a.columns.get_loc(0)
j1 = a.columns.get_loc(1)
g_ = {n: i for n, i in g.indices.items() if i.size > 1}

a[0] = a[0].astype(object)

for n, i in g_.items():
    i0, i1 = i[-2:]
    cond = a.get_value(i1, j0, takeable=True) > 0.5
    if cond:
        tgap = a.get_value(i1, j1, takeable=True) - a.get_value(i0, j1, takeable=True)
        a.set_value(i1, j0, tgap, takeable=True)
    else:
        a.set_value(i1, j0, 'ELSE', takeable=True)

计时

%timeit a.reset_index().groupby(['first', 'second']).apply(myfunc)
100 loops, best of 3: 7.14 ms per loop

%%timeit
a = b.copy()
g = a.groupby(level=['first', 'second'])

j0 = a.columns.get_loc(0)
j1 = a.columns.get_loc(1)
g_ = {n: i for n, i in g.indices.items() if i.size > 1}

a[0] = a[0].astype(object)

for n, i in g_.items():
    i0, i1 = i[-2:]
    cond = a.get_value(i1, j0, takeable=True) > 0.5
    if cond:
        tgap = a.get_value(i1, j1, takeable=True) - a.get_value(i0, j1, takeable=True)
        a.set_value(i1, j0, tgap, takeable=True)
    else:
        a.set_value(i1, j0, 'ELSE', takeable=True)

1000 loops, best of 3: 1.01 ms per loop

Answer 2

我提出了一个矢量化版本，可以通过几个步骤执行转换。如果你有一个相当复杂的转换，我会建议你尝试将它分成几个步骤，你可以在熊猫中进行分析。

你的第一步是在小组中找到两个以上的小组：

to_check = df.groupby(['first','second']).size().apply(lambda x: True if x > 1 else False)
first  second
bar    one        True
       two       False
baz    one       False
       two       False
foo    one       False
       two        True
qux    one       False
       two       False
dtype: bool

现在我们可以将结果（在索引上）合并回原始的df：

df = df.merge(to_check.rename('check').to_frame(), left_index=True, right_index=True)
                   one                 two  check
first second                                     
bar   one     0.821371 2017-07-02 00:00:00   True
      one     0.312364 2017-07-02 00:05:00   True
      two     0.104821 2017-07-02 00:10:00  False
baz   one     0.839370 2017-07-02 00:15:00  False
      two     0.307262 2017-07-02 00:20:00  False
foo   one     0.719300 2017-07-02 00:25:00  False
      two     0.371118 2017-07-02 00:30:00   True
      two     0.765483 2017-07-02 00:35:00   True
qux   one     0.794236 2017-07-02 00:40:00  False
      two     0.571231 2017-07-02 00:45:00  False

现在我们可以使用'check'列找到我们感兴趣的组，看看我们是否应该进行计算：

to_calc = df.loc[df['check'], 'one'].groupby(['first','second']).apply(lambda x: 1 if x[-1] > 0.5 else 'Else')
first  second
bar    one           Else
foo    two       0.394365

现在我们得到了结果，我们可以将它们合并到df：

df = df.merge(to_calc.to_frame(), left_index=True, right_index=True, how='outer')
                   one                 two  check to_calc
first second                                             
bar   one     0.821371 2017-07-02 00:00:00   True    Else
      one     0.312364 2017-07-02 00:05:00   True    Else
      two     0.104821 2017-07-02 00:10:00  False     NaN
baz   one     0.839370 2017-07-02 00:15:00  False     NaN
      two     0.307262 2017-07-02 00:20:00  False     NaN
foo   one     0.719300 2017-07-02 00:25:00  False     NaN
      two     0.371118 2017-07-02 00:30:00   True       1
      two     0.765483 2017-07-02 00:35:00   True       1
qux   one     0.794236 2017-07-02 00:40:00  False     NaN
      two     0.571231 2017-07-02 00:45:00  False     NaN

我们可以用来执行实际计算：

df_calc = df.loc[df['to_calc']==1,'two'].groupby(level=['first','second']).apply(lambda x: x[-1]-x[-2])
df = df.merge(df_calc.rename('calc').to_frame(), left_index=True, right_index=True, how='outer')
                   one                 two  check to_calc     calc
first second                                                      
bar   one     0.821371 2017-07-02 00:00:00   True    Else      NaT
      one     0.312364 2017-07-02 00:05:00   True    Else      NaT
      two     0.104821 2017-07-02 00:10:00  False     NaN      NaT
baz   one     0.839370 2017-07-02 00:15:00  False     NaN      NaT
      two     0.307262 2017-07-02 00:20:00  False     NaN      NaT
foo   one     0.719300 2017-07-02 00:25:00  False     NaN      NaT
      two     0.371118 2017-07-02 00:30:00   True       1 00:05:00
      two     0.765483 2017-07-02 00:35:00   True       1 00:05:00
qux   one     0.794236 2017-07-02 00:40:00  False     NaN      NaT
      two     0.571231 2017-07-02 00:45:00  False     NaN      NaT

将结果仅写入每个组的最后一个字段现在相当棘手，但可以通过组合pandas nan和重复测试来实现：

calc_select = (~pd.isnull(df['calc']))&~(
       df.duplicated(subset=['first','second'],keep='last'))
else_select = (df['to_calc']=='Else')&~(
       df.duplicated(subset=['first','second'],keep='last'))
df.loc[else_select, 'one'] =  df.loc[else_select, 'to_calc']
df.loc[calc_select, 'one'] = df.loc[calc_select, calc'].astype('O')

first second              one                 two  check to_calc     calc
0   bar    one         0.821371 2017-07-02 00:00:00   True    Else      NaT
1   bar    one             Else 2017-07-02 00:05:00   True    Else      NaT
2   bar    two         0.104821 2017-07-02 00:10:00  False     NaN      NaT
3   baz    one          0.83937 2017-07-02 00:15:00  False     NaN      NaT
4   baz    two         0.307262 2017-07-02 00:20:00  False     NaN      NaT
5   foo    one           0.7193 2017-07-02 00:25:00  False     NaN      NaT
6   foo    two         0.371118 2017-07-02 00:30:00   True       1 00:05:00
7   foo    two  0 days 00:05:00 2017-07-02 00:35:00   True       1 00:05:00
8   qux    one         0.794236 2017-07-02 00:40:00  False     NaN      NaT
9   qux    two         0.571231 2017-07-02 00:45:00  False     NaN      NaT

对于测试数据，我的解决方案实际上比你的解决方案运行速度慢（18.4毫秒与你的7.08毫秒相比），但我会假设矢量化解决方案在更大的数据集上运行得更快。如果你能为你的数据集提供一些时间，我真的很感兴趣。

有效地在pandas groupby对象中填充元素的方法（可能通过应用函数）

2 个答案: