删除熊猫groupby中的空或南组

时间:2020-04-01 08:27:31

标签: python pandas pandas-groupby

在数据框中,某些行中包含一些空(NaN)值-下面的示例

s = pd.DataFrame([[39877380,158232151,20], [39877380,332086469,], [39877380,39877381,14], [39877380,39877383,8], [73516838,6439138,1], [73516838,6500551,], [735571896,203559638,], [735571896,282186552,], [736453090,6126187,], [673117474,12196071,], [673117474,12209800,], [673117474,618058747,6]], columns=['start','end','total'])

当我按开始和结束列分组时

s.groupby(['start', 'end']).total.sum()

我得到的输出是

start      end
39877380   39877381    14.00
           39877383     8.00
           158232151   20.00
           332086469     nan
73516838   6439138      1.00
           6500551       nan
673117474  12196071      nan
           12209800      nan
           618058747    6.00
735571896  203559638     nan
           282186552     nan
736453090  6126187       nan

我要排除所有以end结尾的值均为'nan'的所有开始组-预期输出-

start      end
39877380   39877381    14.00
           39877383     8.00
           158232151   20.00
           332086469     nan
73516838   6439138      1.00
           6500551       nan
673117474  12196071      nan
           12209800      nan
           618058747    6.00

我尝试使用dropna(),但它删除了所有nan值,而不是nan组。

我是python和pandas的新手。有人可以帮我吗?谢谢

1 个答案:

答案 0 :(得分:1)

在较新的熊猫版本中,如果使用min_count=1,则必须使用sum来缺少值:

s1 = s.groupby(['start', 'end']).total.sum(min_count=1)
#oldier pandas version solution
#s1 = s.groupby(['start', 'end']).total.sum()

如果通过Series.notnaGroupBy.transformGroupBy.any通过boolean indexing每一级至少有一个非缺失值,则可以通过MultiIndex.get_level_values过滤:

s2 = s1[s1.notna().groupby(level=0).transform('any')]
#oldier pandas version solution
#s2 = s1[s1.notnull().groupby(level=0).transform('any')]
print (s2)
start      end      
39877380   39877381     14.0
           39877383      8.0
           158232151    20.0
           332086469     NaN
73516838   6439138       1.0
           6500551       NaN
673117474  12196071      NaN
           12209800      NaN
           618058747     6.0
Name: total, dtype: float64

或者可以通过DataFrame.loc获取一级索引值的唯一值,并通过Attachments - Create Test Run Attachment进行过滤:

idx = s1.index.get_level_values(0)
s2 = s1.loc[idx[s1.notna()].unique()]
#oldier pandas version solution
#s2 = s1.loc[idx[s1.notnull()].unique()]
print (s2)
start      end      
39877380   39877381     14.0
           39877383      8.0
           158232151    20.0
           332086469     NaN
73516838   6439138       1.0
           6500551       NaN
673117474  12196071      NaN
           12209800      NaN
           618058747     6.0
Name: total, dtype: float64