Question

我想在Pandas DataFrame中找到连续NaN的那些索引，其中超过3个连续的NaN返回它们的大小。那就是：

58234         NaN
58235         NaN
58236    0.424323
58237    0.424323
58238         NaN
58239         NaN
58240         NaN
58241         NaN
58242         NaN
58245         NaN
58246    1.483380
58247    1.483380

应该返回类似的内容（58238,6）。回报的实际格式并不重要。我找到了以下内容。

df.a.isnull().astype(int).groupby(df.a.notnull().astype(int).cumsum()).sum()

但它没有为每个索引返回正确的值。这个问题可能与Identifying consecutive NaN's with pandas非常相似但任何帮助都会非常感激，因为我是熊猫的总菜鸟。

Answer 1

我打破了步骤

df['Group']=df.a.notnull().astype(int).cumsum()
df=df[df.a.isnull()]
df=df[df.Group.isin(df.Group.value_counts()[df.Group.value_counts()>3].index)]
df['count']=df.groupby('Group')['Group'].transform('size')
df.drop_duplicates(['Group'],keep='first')
Out[734]: 
        a  Group  count
ID                     
58238 NaN      2      6

Answer 2

假设df将这两列命名为：A，B，这是一种矢量化方法 -

thresh = 3

a = df.A.values
b = df.B.values

idx0 = np.flatnonzero(np.r_[True, np.diff(np.isnan(b))!=0,True])
count = np.diff(idx0)
idx = idx0[:-1]
valid_mask = (count>=thresh) & np.isnan(b[idx])
out_idx = idx[valid_mask]
out_num = a[out_idx]
out_count = count[valid_mask]
out = zip(out_num, out_count)

示例输入，输出 -

In [285]: df
Out[285]: 
        A         B
0   58234       NaN
1   58235       NaN
2   58236  0.424323
3   58237  0.424323
4   58238       NaN
5   58239       NaN
6   58240       NaN
7   58241       NaN
8   58242       NaN
9   58245       NaN
10  58246  1.483380
11  58247  1.483380

In [286]: out
Out[286]: [(58238, 6)]

使用thresh = 2，我们有 -

In [288]: out
Out[288]: [(58234, 2), (58238, 6)]

Answer 3

所以这会有点慢，但我也是熊猫和蟒蛇的学习新手。这是非常丑陋的，但我不知道你的数据集是我怎么做的。

current_consec = 0
threeormore = 0

for i in dataset[whatever column you need]:
    if pd.isnull(i):
        if current_consec == 3:
            current_consec = 0
            threeormore += 1
        else:
            current_consec += 1
   else:
      current_consec = 0

因为它将以数字方式运行indx，它将找到按顺序运行的每个。唯一的问题是，如果你不想计算每次连续三次（看到6连续），你必须稍微修改代码，不要将current_consec改为0并创建一个pass语句。

对不起，这是一个新的答案，但它可能会有效，如果您发现更快的内容，请告诉我，因为我很乐意将其添加到我的知识库中。

祝你好运，

Andy M

Answer 4

不幸的是，groupby并不适用于NaN值，所以这里做了你想做的事情有点肮脏（在我创建假列＆gt; _＆gt;的时候很脏）。

顺便说一下，itertools.groupby函数的工作方式是它将具有相同键函数值的连续项分组。枚举给出一个索引和nanindices的值（例如，如果nanindices是[0,1,4,5,6]，枚举返回[（0,0），（1,1），（2,4），（3， 5），（4,6）]）。关键功能是索引减去值。注意，当值和索引同时上升一个（即连续）时，该差异是相同的。因此，这将连续数字分组。

itemgetter（n）只是一个可调用的对象，您可以将其应用于项目以使用它的__getitem__函数获取它的第n个元素。我将它映射到groupby的结果只是因为你不能直接在iterable上调用length，g，它返回。如果您不想获得实际的连续值，您可以简单地将g转换为列表并调用长度。

import numpy as np
import pandas as pd
import itertools
from operator import itemgetter

locations = []
df = pd.DataFrame([np.NaN]*2+[5]*3+[np.NaN]*3+[4]*3+[3]*2+[np.NaN]*4, columns=['A'])
df['B'] = df.fillna(-1)
nanindices = df.reset_index().groupby('B')['index'].apply(np.array).loc[-1]
for k, g in itertools.groupby(enumerate(nanindices), lambda (i, x): i-x):
    consec = map(itemgetter(1), g)
    num_consec = len(consec)
    if (num_consec >= 3):
        locations.append((consec[0], num_consec))

print locations

对于我使用的DF样本，样本数据如下：

     A
0   NaN
1   NaN
2   5.0
3   5.0
4   5.0
5   NaN
6   NaN
7   NaN
8   4.0
9   4.0
10  4.0
11  3.0
12  3.0
13  NaN
14  NaN
15  NaN
16  NaN

程序打印出来：

[(5, 3), (13, 4)]

在Pandas DataFrame中连续NaN大于阈值

4 个答案: