如何从pandas df中过滤少于15个条目的月份?

时间:2018-11-24 21:38:25

标签: python pandas filter timestamp conditional

我有一个从1960年到2017年的年月日组织的多索引数据框,我希望能够检查一个月是否包含超过15个NaN。

有人可以帮助我找出有效的方法吗?

先谢谢您。 Data frame

                           A    B   C   D   E   F   G   H
Year    Month   Day                             
1960    6        1  0.053142    0.632151    NaN -0.740130   NaN -1.273792   NaN -0.287078
                 2  0.827514    -0.487477   NaN -0.246897   NaN -0.310194   NaN 2.150300
                 3  -1.403216   0.350322    NaN 2.134335    NaN 0.023102    NaN 0.343759
                 4  0.305884    0.663174    NaN -2.073908   NaN 0.400311    NaN 0.149292
                 5  0.720521    -2.081981   NaN 0.672169    NaN -0.172794   NaN -0.549559
                 6  -0.987216   -1.190550   NaN 0.318706    NaN 0.863885    NaN -0.995961
                 7  1.781080    0.636422    NaN -0.382552   NaN -0.109566   NaN 0.410586
                 8  -0.654413   -0.094920   NaN -1.763118   NaN 0.075046    NaN -1.130280
                 9  -0.634353   -1.514066   NaN -0.003556   NaN -1.560351   NaN 1.001637
                 10 -1.742696   1.173806    NaN 0.909725    NaN -1.428291   NaN -1.369954

1 个答案:

答案 0 :(得分:0)

df示例如下:

# create a test dataframe similar to yours
df = pd.DataFrame(np.random.randn(10,8), columns=list('ABCDEFGH'))
df[['C', 'E', 'G']] = np.nan
df['Year'] = 1960
df['Month'] = 6
df['Day'] = range(1,11)

df2 = pd.DataFrame(np.random.randn(10,8), columns=list('ABCDEFGH'))
df2[['B']] = np.nan
df2['Year'] = 1960
df2['Month'] = 7
df2['Day'] = range(1,11)
new_df = pd.concat([df,df2])
new_df.set_index(['Year', 'Month', 'Day'], inplace=True)

然后您可以执行以下操作:

# find all nan values then stack and groupby to find the sum of true  for each group
# this is grouping on year and month change the level/levels you want to group
stackdf = pd.isna(new_df).stack().groupby(level=[0,1]).transform(sum)

# filter original df where the index is in the stacked df index
# where the stackdf sum is greater than 15
new_df[new_df.index.isin(stackdf[stackdf>15].unstack().index)]

                       A    B   C   D   E   F   G   H
Year    Month   Day                             
1960    6        1  0.053142    0.632151    NaN -0.740130   NaN -1.273792   NaN -0.287078
                 2  0.827514    -0.487477   NaN -0.246897   NaN -0.310194   NaN 2.150300
                 3  -1.403216   0.350322    NaN 2.134335    NaN 0.023102    NaN 0.343759
                 4  0.305884    0.663174    NaN -2.073908   NaN 0.400311    NaN 0.149292
                 5  0.720521    -2.081981   NaN 0.672169    NaN -0.172794   NaN -0.549559
                 6  -0.987216   -1.190550   NaN 0.318706    NaN 0.863885    NaN -0.995961
                 7  1.781080    0.636422    NaN -0.382552   NaN -0.109566   NaN 0.410586
                 8  -0.654413   -0.094920   NaN -1.763118   NaN 0.075046    NaN -1.130280
                 9  -0.634353   -1.514066   NaN -0.003556   NaN -1.560351   NaN 1.001637
                 10 -1.742696   1.173806    NaN 0.909725    NaN -1.428291   NaN -1.369954

您也可以通过执行new_df[new_df.index.isin(stackdf[stackdf<15].unstack().index)]

来查看小于15的那些
                       A    B   C   D   E   F   G   H
Year    Month   Day                             
1960     7       1  0.994542    NaN 0.488464    0.809915    0.144305    -1.092597   0.555626    0.012135
                 2  -0.682796   NaN -0.781031   -0.847972   0.238397    0.364584    -0.271764   0.930113
                 3  0.254320    NaN -0.474764   0.154370    -1.497867   -1.454383   0.191503    0.494441
                 4  0.994579    NaN 0.362073    -0.537878   -0.512388   -0.501573   0.315398    1.377701
                 5  0.623287    NaN 1.286725    -0.770290   -0.614005   0.552683    0.225974    -0.564017
                 6  -0.252969   NaN -1.127418   -0.357725   -1.069318   0.218666    1.296458    -0.319678
                 7  0.202788    NaN 0.385931    -0.169915   0.167754    0.821923    0.181937    -0.198668
                 8  -0.272891   NaN 0.963414    0.887208    -1.903742   -2.026687   0.897575    1.148448
                 9  1.398781    NaN -0.298804   -1.081953   -1.346193   0.926548    0.147855    -1.632059
                 10 0.489751    NaN 0.433767    0.752071    -0.714030   -1.776365   0.247908    0.919387

因为我使用的是堆栈,所以要计算一组中的所有NaN值,而不是一个特定的列。