大家好,我有这样的问题:
我有400.000个对象的面板数据,并且如果其中NaN含量超过40%,我想删除对象
例如:
inn time_reg revenue1 balans1 equity1 opprofit1 \
0 0101000021 2006 457000.0 115000.0 28000.0 29000.0
1 0101000021 2007 1943000.0 186000.0 104000.0 99000.0
2 0101000021 2008 2812000.0 318000.0 223000.0 127000.0
3 0101000021 2009 2673000.0 370000.0 242000.0 39000.0
4 0101000021 2010 3240000.0 435000.0 45000.0 NaN
... ... ... ... ... ... ...
4081810 9909403758 2003 6943000.0 2185000.0 2136000.0 -97000.0
4081811 9909403758 2004 6504000.0 2245000.0 2196000.0 -34000.0
4081812 9909403758 2005 NaN NaN NaN NaN
4081813 9909403758 2006 NaN NaN NaN NaN
4081814 9909403758 2007 NaN NaN NaN NaN
grossprofit1 netprofit1 currentassets1 stliabilities1
0 92000.0 18000.0 105000.0 87000.0
1 189000.0 76000.0 176000.0 82000.0
2 472000.0 119000.0 308000.0 95000.0
3 483000.0 29000.0 360000.0 128000.0
4 NaN 35000.0 NaN NaN
... ... ... ... ...
4081810 2365000.0 -59000.0 253000.0 49000.0
4081811 2278000.0 60000.0 425000.0 49000.0
4081812 NaN NaN NaN NaN
4081813 NaN NaN NaN NaN
4081814 NaN NaN NaN NaN
我有这样的数据框,并且对于按(inn,time_reg)分组的每个子数据框,如果列中的总nans(收入1 balans1 Equity1 opprofit1 grossprofit1 netprofit1 currentassets1 stlibility1)超过40%,则需要删除它。
我有一个循环的想法,但这需要很多时间 例如:
inn time_reg revenue1 balans1 equity1 opprofit1 \
4081809 9909403758 2002 6078000.0 2270000.0 2195000.0 -32000.0
4081810 9909403758 2003 6943000.0 2185000.0 2136000.0 -97000.0
4081811 9909403758 2004 6504000.0 2245000.0 2196000.0 -34000.0
4081812 9909403758 2005 NaN NaN NaN NaN
4081813 9909403758 2006 NaN NaN NaN NaN
4081814 9909403758 2007 NaN NaN NaN NaN
grossprofit1 netprofit1 currentassets1 stliabilities1
4081809 1324000.0 NaN 234000.0 75000.0
4081810 2365000.0 -59000.0 253000.0 49000.0
4081811 2278000.0 60000.0 425000.0 49000.0
4081812 NaN NaN NaN NaN
4081813 NaN NaN NaN NaN
4081814 NaN NaN NaN NaN
应该删除此子数据框,因为它包含40%以上的nans
inn time_reg revenue1 balans1 equity1 opprofit1 \
0 0101000021 2006 457000.0 115000.0 28000.0 29000.0
1 0101000021 2007 1943000.0 186000.0 104000.0 99000.0
2 0101000021 2008 2812000.0 318000.0 223000.0 127000.0
3 0101000021 2009 2673000.0 370000.0 242000.0 39000.0
4 0101000021 2010 3240000.0 435000.0 45000.0 NaN
5 0101000021 2011 3480000.0 610000.0 71000.0 NaN
6 0101000021 2012 4820000.0 710000.0 139000.0 149000.0
7 0101000021 2013 5200000.0 790000.0 148000.0 170000.0
8 0101000021 2014 5450000.0 830000.0 155000.0 180000.0
9 0101000021 2015 5620000.0 860000.0 164000.0 189000.0
10 0101000021 2016 5860000.0 885000.0 175000.0 200000.0
11 0101000021 2017 15112000.0 1275000.0 298000.0 323000.0
grossprofit1 netprofit1 currentassets1 stliabilities1
0 92000.0 18000.0 105000.0 87000.0
1 189000.0 76000.0 176000.0 82000.0
2 472000.0 119000.0 308000.0 95000.0
3 483000.0 29000.0 360000.0 128000.0
4 NaN 35000.0 NaN NaN
5 NaN 61000.0 NaN NaN
6 869000.0 129000.0 700000.0 571000.0
7 1040000.0 138000.0 780000.0 642000.0
8 1090000.0 145000.0 820000.0 675000.0
9 1124000.0 154000.0 850000.0 696000.0
10 1172000.0 165000.0 875000.0 710000.0
11 3023000.0 288000.0 1265000.0 977000.0
此子数据框包含的nans少于40%,并且必须位于最终数据框中
答案 0 :(得分:0)
如果您使用numpy / pandas函数进行计数,循环会不会太慢?您可以使用someDataFrame.isnull().sum().sum()
。
可能要比编写自己的循环遍历数据帧中的所有值快得多,因为这些库往往具有此类功能的非常有效的实现。
答案 1 :(得分:0)
您可以使用filter
的{{1}}方法。
这使您可以传递一个函数,该函数指示是否应过滤子帧(在这种情况下,如果相关列中的NaN含量超过40%)。要获取该信息,可以使用pd.DataFrame.groupby
来计算numpy
中的nans:
getNanFraction