Question

考虑以下示例：

import pandas as pd
import numpy as np

df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
                         'foo', 'bar', 'foo', 'foo'],
                   'B' : [12,10,-2,-4,-2,5,8,7],
                   'C' : [-5,5,-20,0,1,5,4,-4]})

df
Out[12]: 
     A   B   C
0  foo  12  -5
1  bar  10   5
2  foo  -2 -20
3  bar  -4   0
4  foo  -2   1
5  bar   5   5
6  foo   8   4
7  foo   7  -4

在这里，我需要计算，对于 A 中的每个组，B中元素的总和 条件C为非负（即，>≥0，基于另一列的条件）。反之亦然，C。

但是，我的代码失败了。

df.groupby('A').agg({'B': lambda x: x[x.C>0].sum(), 'C': lambda x: x[x.B>0].sum()}) AttributeError: 'Series' object has no attribute 'B'

所以似乎首选apply（因为应用看到我认为的所有数据帧），但遗憾的是我无法使用apply的字典。所以我被困住了。有什么想法吗？

在运行groupby之前创建这些条件变量的一个不那么不那么有效的解决方案，但我确信这个解决方案不会使用 {{1}的潜力}

因此，例如，组Pandas和bar的预期输出将是

column B

另一个例子：小组+10 (indeed C equals 5 and is >=0) -4 (indeed C equals 0 and is >=0) +5 = 11和foo

column B

请注意我使用NaN (indeed C equals -5 so I dont want to consider the 12 value in B) + NaN (indeed C= -20) -2 (indeed C=1 so its positive) + 8 +NaN = 6而不是零，因为如果我们要放零，则除了和之外的其他函数会给出错误的结果（中位数）。

换句话说，这是一个简单的条件求和，其中条件基于另一列。谢谢！

Answer 1

另一种方法是在使用<?php $serverName = "serverName\instanceName"; $connectionInfo = array( "Database"=>"dbName", "UID"=>"username", "PWD"=>"password"); // connect to sql server $conn = sqlsrv_connect( $serverName, $connectionInfo ); if( $conn === false ) { die( print_r( sqlsrv_errors(), true)); } // create an array to hold the references $refs = array(); // create and array to hold the list $list = array(); $tsql = "SELECT ID, IDPARENT, NAME, URL FROM menu_items ORDER BY NAME;" $stmt = sqlsrv_query( $conn, $tsql); if( $stmt === false) { die( print_r( sqlsrv_errors(), true) ); } while($row = sqlsrv_fetch_array( $stmt, SQLSRV_FETCH_ASSOC)) { // Assign by reference $thisref = &$refs[ $row['ID'] ]; // add the the menu parent $thisref['IDPARENT'] = $row['IDPARENT']; $thisref['NAME'] = $row['NAME']; $thisref['URL'] = $row['URL']; // if there is no parent id if ($row['IDPARENT'] == 0) { $list[ $row['ID'] ] = &$thisref; } else { $refs[ $row['IDPARENT'] ]['children'][ $row['ID'] ] = &$thisref; } } mssql_close($conn); /** * * Create a HTML menu from an array * * @param array $arr * @param string $list_type * @return string * */ function create_menu( $arr ) { $html = "\n<ul>\n"; foreach ($arr as $key=>$val) { $html .= '<li><a href="'.$val['URL'].'">'.$val['NAME']."</a>"; if (array_key_exists('children', $val)) { $html .= create_menu($val['children']); } $html .= "</li>\n"; } $html .= "</ul>\n"; return $html; } echo create_menu( $list ); ?>之前预先计算您需要的值：

groupby/agg

让我们将import numpy as np import pandas as pd N = 1000 df = pd.DataFrame({'A' : np.random.choice(['foo', 'bar'], replace=True, size=(N,)), 'B' : np.random.randint(-10, 10, size=(N,)), 'C' : np.random.randint(-10, 10, size=(N,))}) def using_precomputation(df): df['B2'] = df['B'] * (df['C'] >= 0).astype(int) df['C2'] = df['C'] * (df['B'] >= 0).astype(int) result = df.groupby('A').agg({'B2': 'sum', 'C2': 'sum'}) return result.rename(columns={'B2':'B', 'C2':'C'})与using_precomputation和using_index进行比较：

using_apply

首先，让我们检查它们是否都返回相同的结果：

def using_index(df):
    result = df.groupby('A').agg({'B': lambda x: df.loc[x.index, 'C'][x >= 0].sum(), 
                                  'C': lambda x: df.loc[x.index, 'B'][x >= 0].sum()}) 
    return result.rename(columns={'B':'C', 'C':'B'})

def my_func(row):
    b = row[row.C >= 0].B.sum()
    c = row[row.B >= 0].C.sum()
    return pd.Series({'B':b, 'C':c})

def using_apply(df):
    return df.groupby('A').apply(my_func)

使用上面的1000行DataFrame：

def is_equal(df, func1, func2):
    result1 = func1(df).sort_index(axis=1)
    result2 = func2(df).sort_index(axis=1)
    assert result1.equals(result2)
is_equal(df, using_precomputation, using_index)
is_equal(df, using_precomputation, using_apply)

为什么In [83]: %timeit using_precomputation(df) 100 loops, best of 3: 2.45 ms per loop In [84]: %timeit using_index(df) 100 loops, best of 3: 4.2 ms per loop In [85]: %timeit using_apply(df) 100 loops, best of 3: 6.84 ms per loop更快？

预计算允许我们利用快速矢量化算法整列并允许聚合函数为简单内置函数 using_precomputation。内置聚合器往往比自定义聚合函数更快比如这里使用的（基于jezrael＆＃39;解决方案）：

sum

此外，你必须对每个小组做的工作越少，你就越好是表现方面的。必须为每个组进行双重索引会损害性能。

同样性能的杀手就是使用def using_index(df): result = df.groupby('A').agg({'B': lambda x: df.loc[x.index, 'C'][x >= 0].sum(), 'C': lambda x: df.loc[x.index, 'B'][x >= 0].sum()}) return result.rename(columns={'B':'C', 'C':'B'}) groupby/apply(func) 返回func。这为结果的每一行形成一个系列，然后导致Pandas对齐并连接所有系列。因为通常是系列往往很短，系列的数量往往很大，连接所有这些小系列往往都很慢。再次，你倾向于获得最好的执行矢量化操作时，Pandas / NumPy的性能大数组。循环通过许多微小的结果会导致性能下降。

Answer 2

我认为你可以使用：

print df.groupby('A').agg({'B': lambda x: df.loc[x.index, 'C'][x >= 0].sum(), 
                           'C': lambda x: df.loc[x.index, 'B'][x >= 0].sum()})  
      C   B
A          
bar  11  10
foo   6  -5

更好的理解是自定义功能与上述相同：

def f(x):
    s = df.loc[x.index, 'C']
    return s[x>=0].sum()
def f1(x):
    s = df.loc[x.index, 'B']
    return s[x>=0].sum()


print df.groupby('A').agg({'B': f, 'C': f1})
      C   B
A          
bar  11  10
foo   6  -5

编辑：

root＆＃39; s solution非常好，但它可能更好：

def my_func(row):
    b = row[row.C >= 0].B.sum()
    c = row[row.B >= 0].C.sum()
    return pd.Series({'C':b, 'B':c})

result = df.groupby('A').apply(my_func)
      C   B
A          
bar  11  10
foo   6  -5

Answer 3

您可以使用apply返回包含所需字段的元组，然后使用zip将其解压缩。

def my_func(row):
    b = row[row.C >= 0].B.sum()
    c = row[row.B >= 0].C.sum()
    return b, c

# Perform the groupby aggregation.
result = df.groupby('A').apply(my_func).to_frame()

# Unpack the resulting tuple and drop the extra column.
result['B'], result['C'] = zip(*result[0])
result.drop(0, axis=1, inplace=True)

这会产生以下输出：

      B   C
A          
bar  11  10
foo   6  -5

如何在Pandas groupby之后获得多个条件操作？

3 个答案: