Question

我正在尝试使用Pandas向群组添加过滤器。在下面的棒球数据中，我希望找出从导入列中的初始“N”到最终“Y”所需的平均时间。基本上我想要计算每个组的长度，包括导入列中的'Y'并且有多行。任何提示都会有帮助！

   playerID  yearid votedBy  ballots  needed  votes inducted category needed_note
2860  aaronha01    1982   BBWAA      415     312    406        Y   Player         NaN
3743  abbotji01    2005   BBWAA      516     387     13        N   Player         NaN
 146  adamsba01    1937   BBWAA      201     151      8        N   Player         NaN
 259  adamsba01    1938   BBWAA      262     197     11        N   Player         NaN
 384  adamsba01    1939   BBWAA      274     206     11        N   Player         NaN
 497  adamsba01    1942   BBWAA      233     175     11        N   Player         NaN
 574  adamsba01    1945   BBWAA      247     186      7        N   Player         NaN
2108  adamsbo03    1966   BBWAA      302     227      1        N   Player         NaN

Answer 1

我修改了您的数据集，以便有两个这样的组。一个有N到Y的两行，另一行有N到Y的8行。这取决于您是否计入包含y行的In [25]: df=pd.read_clipboard() print df playerID yearid votedBy ballots needed votes inducted category needed_note 3741 abbotji01 2005 BBWAA 516 387 13 N Player NaN 2860 aaronha01 1982 BBWAA 415 312 406 Y Player NaN 3743 abbotji01 2005 BBWAA 516 387 13 N Player NaN 146 adamsba01 1937 BBWAA 201 151 8 N Player NaN 259 adamsba01 1938 BBWAA 262 197 11 N Player NaN 384 adamsba01 1939 BBWAA 274 206 11 N Player NaN 497 adamsba01 1942 BBWAA 233 175 11 N Player NaN 574 adamsba01 1945 BBWAA 247 186 7 N Player NaN 2108 adamsbo03 1966 BBWAA 302 227 1 N Player NaN 2861 aaronha01 1982 BBWAA 415 312 406 Y Player NaN In [26]: df['isY']=(df.inducted=='Y') df['isY']=np.hstack((0,df['isY'].cumsum().values[:-1])).T In [27]: print df.groupby('isY').count() playerID yearid votedBy ballots needed votes inducted category needed_note isY 0 2 2 2 2 2 2 2 2 0 2 1 8 8 8 8 8 8 8 8 0 8 [2 rows x 10 columns]。如果没有，它将有两个组，一个包含1行，另一个包含7行。它看起来你没有时间序列列，所以我想这意味着行按时间均匀分布。

假设您不计算df2=df.groupby('isY').count().isY-1 df2[df2!=1].mean()，则可以通过以下方式计算平均值：

{{1}}

Answer 2

我模拟了自己的数据，以便轻松测试您的问题。我创建了一组名为df_inducted的玩家，其中包括最终被导入的玩家，然后使用isin（）函数，我们可以确保只在分析中考虑它们。然后我找到他们日期的最小值和最大值并平均他们的差异。

import pandas as pd

df = pd.DataFrame({'player':['Nate','Will','Nate','Will'], 
                   'inducted': ['Y','Y','N','N'],
                   'date':[2014,2000,2011,1999]})

df_inducted = df[df.inducted=='Y']
df_subset = df[df.player.isin(df_inducted.player)]

maxs = df_subset.groupby('player')['date'].max()
mins = df_subset.groupby('player')['date'].min()

maxs = pd.DataFrame(maxs)
maxs.columns = ['max_date']
mins = pd.DataFrame(mins)
mins.columns = ['min_date']

min_and_max = maxs.join(mins)
final = min_and_max['max_date'] - min_and_max['min_date']

print "average time:", final.mean()

Answer 3

类 DataFrameGroupBy 的过滤器方法在组中的每个子帧上运行。见help(pd.core.groupby.DataFrameGroupBy.filter)。数据是：

print df
  inducted playerID
0        Y        a
1        N        a
2        N        a
3        Y        b
4        N        b
5        N        c
6        N        c
7        N        c

示例代码：

import pandas as pd

g = df.groupby('playerID')
madeit = g.filter(
        lambda subframe:
                'Y' in set(subframe.inducted)).groupby('playerID')

# The filter removed player 'c' who never has inducted == 'Y'
print madeit.head()
           inducted playerID
playerID                    
a        0        Y        a
         1        N        a
         2        N        a
b        3        Y        b
         4        N        b

# The 'aggregate' function applies a function to each subframe
print madeit.aggregate(len)
          inducted
playerID          
a                3
b                2

使用Pandas过滤组

3 个答案: