有效地在pandas groupby对象中填充元素的方法(可能通过应用函数)

时间:2017-07-06 16:33:32

标签: python performance pandas numpy vectorization

尝试将函数应用于从具有大约150,000行的数据框派生的groupby对象时,我遇到了性能问题。

为简单起见,让我们处理虚拟数据帧a

arrays = [['bar', 'bar','bar', 'baz', 'baz', 'foo', 'foo', 'foo', 'qux', 'qux'],
             ['one', 'one','two', 'one', 'two', 'one', 'two', 'two', 'one', 'two']]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
a = pd.DataFrame(np.random.random((10,)), index = index)
a[1] = pd.date_range('2017-07-02', periods=10, freq='5min')


a
Out[24]: 
                     0                   1
first second                              
bar   one     0.821371 2017-07-02 00:00:00
      one     0.312364 2017-07-02 00:05:00
      two     0.104821 2017-07-02 00:10:00
baz   one     0.839370 2017-07-02 00:15:00
      two     0.307262 2017-07-02 00:20:00
foo   one     0.719300 2017-07-02 00:25:00
      two     0.371118 2017-07-02 00:30:00
      two     0.765483 2017-07-02 00:35:00
qux   one     0.794236 2017-07-02 00:40:00
      two     0.571231 2017-07-02 00:45:00

我想根据此函数中描述的逻辑有条件地填充每个0 - first组中second列的底部元素

def myfunc(g):


if( len(g) >= 2): # if each group's length is greater than or equal to 2, then:

    if ((g.loc[g.index[-1], 0] > 0.5)): # If the last element of the 0th column of the group > 0.5, then:

        time_gap = g.loc[g.index[-1], 1] - g.loc[g.index[-2], 1] # Find the time difference between the last two records in 1st column

        g.loc[g.index[-1], 0] = time_gap # and assign it to the last element in the 0th column of that group

    else: 
        g.loc[g.index[-1], 0] = 'ELSE' # Assign ELSE to the last element of the 0th column of the group    


return g

应用此函数会产生

a.reset_index().groupby(['first', 'second']).apply(myfunc)
Out[23]: 
  first second                0                   1
0   bar    one         0.821371 2017-07-02 00:00:00
1   bar    one             ELSE 2017-07-02 00:05:00  correct
2   bar    two         0.104821 2017-07-02 00:10:00
3   baz    one          0.83937 2017-07-02 00:15:00
4   baz    two         0.307262 2017-07-02 00:20:00
5   foo    one           0.7193 2017-07-02 00:25:00
6   foo    two         0.371118 2017-07-02 00:30:00
7   foo    two  0 days 00:05:00 2017-07-02 00:35:00  correct
8   qux    one         0.794236 2017-07-02 00:40:00
9   qux    two         0.571231 2017-07-02 00:45:00

以上结果正是我想要的。问题是这种方法在应用于我有大约150,000行的数据帧时冻结了我的16GB / i5-6200U CPU @ 2.3GHz计算机。

有条件地填充这些元素的最有效方法是什么?(可能)我需要编写一个函数?

注意:我在Windows 10的jupyter笔记本中运行它 - 如果这很重要

2 个答案:

答案 0 :(得分:4)

这里有几个问题。

  1. 您正在通过应用从组内编辑数据框。这是很多调试的必要条件。
  2. apply中使用groupby时,您可以为每个组创建新的数据框。我们可以通过操纵组的索引来提高性能。
  3. 您无需重置索引即可按索引级别进行分组
  4. 首先,请复制一份a,以防翻译中丢失一些内容,我不希望你搞砸a ......就在那里。

    a_ = a.copy()
    

    好的,为了加快速度

    g = a.groupby(level=['first', 'second'])
    

    我将使用get_valueset_value takeable=Truetakeable选项允许我使用其他参数作为位置引用。因此,我想确保我的位置正确。

    j0 = a.columns.get_loc(0)
    j1 = a.columns.get_loc(1)
    

    方便地,g具有indices属性,该属性告诉我每个命名组的所有行的位置。我将创建一个名称和索引字典,通过理解来传递长度为2或更长的第一个障碍。

    g_ = {n: i for n, i in g.indices.items() if i.size > 1}
    

    您要将不同类型的内容放入0列,因为我将使用set_value,所以我最好提前将该列投放为object。< / p>

    a[0] = a[0].astype(object)
    

    现在,我可以遍历通过上述长度障碍的群组。

    for n, i in g_.items():
        i0, i1 = i[-2:]
        cond = a.get_value(i1, j0, takeable=True) > 0.5
        if cond:
            tgap = a.get_value(i1, j1, takeable=True) - a.get_value(i0, j1, takeable=True)
            a.set_value(i1, j0, tgap, takeable=True)
        else:
            a.set_value(i1, j0, 'ELSE', takeable=True)
    

    全部合作

    g = a.groupby(level=['first', 'second'])
    
    j0 = a.columns.get_loc(0)
    j1 = a.columns.get_loc(1)
    g_ = {n: i for n, i in g.indices.items() if i.size > 1}
    
    a[0] = a[0].astype(object)
    
    for n, i in g_.items():
        i0, i1 = i[-2:]
        cond = a.get_value(i1, j0, takeable=True) > 0.5
        if cond:
            tgap = a.get_value(i1, j1, takeable=True) - a.get_value(i0, j1, takeable=True)
            a.set_value(i1, j0, tgap, takeable=True)
        else:
            a.set_value(i1, j0, 'ELSE', takeable=True)
    

    计时

    %timeit a.reset_index().groupby(['first', 'second']).apply(myfunc)
    100 loops, best of 3: 7.14 ms per loop
    
    %%timeit
    a = b.copy()
    g = a.groupby(level=['first', 'second'])
    
    j0 = a.columns.get_loc(0)
    j1 = a.columns.get_loc(1)
    g_ = {n: i for n, i in g.indices.items() if i.size > 1}
    
    a[0] = a[0].astype(object)
    
    for n, i in g_.items():
        i0, i1 = i[-2:]
        cond = a.get_value(i1, j0, takeable=True) > 0.5
        if cond:
            tgap = a.get_value(i1, j1, takeable=True) - a.get_value(i0, j1, takeable=True)
            a.set_value(i1, j0, tgap, takeable=True)
        else:
            a.set_value(i1, j0, 'ELSE', takeable=True)
    
    1000 loops, best of 3: 1.01 ms per loop
    

答案 1 :(得分:0)

我提出了一个矢量化版本,可以通过几个步骤执行转换。如果你有一个相当复杂的转换,我会建议你尝试将它分成几个步骤,你可以在熊猫中进行分析。

你的第一步是在小组中找到两个以上的小组:

to_check = df.groupby(['first','second']).size().apply(lambda x: True if x > 1 else False)
first  second
bar    one        True
       two       False
baz    one       False
       two       False
foo    one       False
       two        True
qux    one       False
       two       False
dtype: bool

现在我们可以将结果(在索引上)合并回原始的df:

df = df.merge(to_check.rename('check').to_frame(), left_index=True, right_index=True)
                   one                 two  check
first second                                     
bar   one     0.821371 2017-07-02 00:00:00   True
      one     0.312364 2017-07-02 00:05:00   True
      two     0.104821 2017-07-02 00:10:00  False
baz   one     0.839370 2017-07-02 00:15:00  False
      two     0.307262 2017-07-02 00:20:00  False
foo   one     0.719300 2017-07-02 00:25:00  False
      two     0.371118 2017-07-02 00:30:00   True
      two     0.765483 2017-07-02 00:35:00   True
qux   one     0.794236 2017-07-02 00:40:00  False
      two     0.571231 2017-07-02 00:45:00  False

现在我们可以使用'check'列找到我们感兴趣的组,看看我们是否应该进行计算:

to_calc = df.loc[df['check'], 'one'].groupby(['first','second']).apply(lambda x: 1 if x[-1] > 0.5 else 'Else')
first  second
bar    one           Else
foo    two       0.394365

现在我们得到了结果,我们可以将它们合并到df:

df = df.merge(to_calc.to_frame(), left_index=True, right_index=True, how='outer')
                   one                 two  check to_calc
first second                                             
bar   one     0.821371 2017-07-02 00:00:00   True    Else
      one     0.312364 2017-07-02 00:05:00   True    Else
      two     0.104821 2017-07-02 00:10:00  False     NaN
baz   one     0.839370 2017-07-02 00:15:00  False     NaN
      two     0.307262 2017-07-02 00:20:00  False     NaN
foo   one     0.719300 2017-07-02 00:25:00  False     NaN
      two     0.371118 2017-07-02 00:30:00   True       1
      two     0.765483 2017-07-02 00:35:00   True       1
qux   one     0.794236 2017-07-02 00:40:00  False     NaN
      two     0.571231 2017-07-02 00:45:00  False     NaN

我们可以用来执行实际计算:

df_calc = df.loc[df['to_calc']==1,'two'].groupby(level=['first','second']).apply(lambda x: x[-1]-x[-2])
df = df.merge(df_calc.rename('calc').to_frame(), left_index=True, right_index=True, how='outer')
                   one                 two  check to_calc     calc
first second                                                      
bar   one     0.821371 2017-07-02 00:00:00   True    Else      NaT
      one     0.312364 2017-07-02 00:05:00   True    Else      NaT
      two     0.104821 2017-07-02 00:10:00  False     NaN      NaT
baz   one     0.839370 2017-07-02 00:15:00  False     NaN      NaT
      two     0.307262 2017-07-02 00:20:00  False     NaN      NaT
foo   one     0.719300 2017-07-02 00:25:00  False     NaN      NaT
      two     0.371118 2017-07-02 00:30:00   True       1 00:05:00
      two     0.765483 2017-07-02 00:35:00   True       1 00:05:00
qux   one     0.794236 2017-07-02 00:40:00  False     NaN      NaT
      two     0.571231 2017-07-02 00:45:00  False     NaN      NaT

将结果仅写入每个组的最后一个字段现在相当棘手,但可以通过组合pandas nan和重复测试来实现:

calc_select = (~pd.isnull(df['calc']))&~(
       df.duplicated(subset=['first','second'],keep='last'))
else_select = (df['to_calc']=='Else')&~(
       df.duplicated(subset=['first','second'],keep='last'))
df.loc[else_select, 'one'] =  df.loc[else_select, 'to_calc']
df.loc[calc_select, 'one'] = df.loc[calc_select, calc'].astype('O')

first second              one                 two  check to_calc     calc
0   bar    one         0.821371 2017-07-02 00:00:00   True    Else      NaT
1   bar    one             Else 2017-07-02 00:05:00   True    Else      NaT
2   bar    two         0.104821 2017-07-02 00:10:00  False     NaN      NaT
3   baz    one          0.83937 2017-07-02 00:15:00  False     NaN      NaT
4   baz    two         0.307262 2017-07-02 00:20:00  False     NaN      NaT
5   foo    one           0.7193 2017-07-02 00:25:00  False     NaN      NaT
6   foo    two         0.371118 2017-07-02 00:30:00   True       1 00:05:00
7   foo    two  0 days 00:05:00 2017-07-02 00:35:00   True       1 00:05:00
8   qux    one         0.794236 2017-07-02 00:40:00  False     NaN      NaT
9   qux    two         0.571231 2017-07-02 00:45:00  False     NaN      NaT

对于测试数据,我的解决方案实际上比你的解决方案运行速度慢(18.4毫秒与你的7.08毫秒相比),但我会假设矢量化解决方案在更大的数据集上运行得更快。如果你能为你的数据集提供一些时间,我真的很感兴趣。