Question

我在使用pandas NAs过滤数据时遇到了一些麻烦。我的数据框看起来像这样：

        Jan       Feb       Mar       Apr       May June
0  0.349143  0.249041  0.244352       NaN  0.425336  NaN
1  0.530616  0.816829       NaN  0.212282  0.099364  NaN
2  0.713001  0.073601  0.242077  0.553908  NaN       NaN
3  0.245295  0.007016  0.444352  0.515705  0.497119  NaN
4  0.195662  0.007249       NaN  0.852287  NaN       NaN

我需要过滤掉有“洞”的行。我认为行是时间序列，而孔是指系列中间的NAs，但不是最后的。即在上面的数据框中，第0,1和4行有孔，但是2和3没有孔（仅在行的末尾有NA）。

到目前为止，我能想到的唯一方法就是这样：

for rowindex, row in df.iterrows():
    # now step through each entry in the row 
    # and after encountering the first NA, 
    # check if all subsequent values are NA too.

但我希望可能会有一种不那么复杂且更有效的方法。

谢谢，安

Answer 1

正如你所说，循环（iterrows）是最后的选择。试试这个，它使用apply和axis=1而不是遍历行。

In [19]: def holey(s):
    starts_at = s.notnull().argmax()
    next_null = s[starts_at:].isnull().argmax()
    if next_null == 0:
        return False
    any_values_left = s[next_null:].notnull().any()
    return any_values_left
   ....: 

In [20]: df.apply(holey, axis=1)
Out[20]: 
0     True
1     True
2    False
3    False
4     True
dtype: bool

现在您可以像df[~df.apply(holey, axis=1)]一样进行过滤。

这里有一个方便的习惯用法：使用argmax()查找一系列布尔值中第一次出现的True。

Answer 2

这是使用NumPy的另一种方式。它更快，因为它在底层数组上使用NumPy函数作为一个整体，而不是单独将Python函数应用于每一行：

import io
import pandas as pd
import numpy as np

content = '''\
        Jan       Feb       Mar       Apr       May June
   0.349143  0.249041  0.244352       NaN  0.425336  NaN
   0.530616  0.816829       NaN  0.212282  0.099364  NaN
   0.713001  0.073601  0.242077  0.553908  NaN       NaN
   0.245295  0.007016  0.444352  0.515705  0.497119  NaN
   0.195662  0.007249       NaN  0.852287  NaN       NaN'''

df = pd.read_table(io.BytesIO(content), sep='\s+')

def remove_rows_with_holes(df):
    nans = np.isnan(df.values)
    # print(nans)
    # [[False False False  True False  True]
    #  [False False  True False False  True]
    #  [False False False False  True  True]
    #  [False False False False False  True]
    #  [False False  True False  True  True]]

    # First index (per row) which is a NaN
    nan_index = np.argmax(nans, axis=1)
    # print(nan_index)
    # [3 2 4 5 2]

    # Last index (per row) which is not a NaN
    h, w = nans.shape
    not_nan_index = w - np.argmin(np.fliplr(nans), axis=1)
    # print(not_nan_index)
    # [5 5 4 5 4]

    mask = nan_index >= not_nan_index
    # print(mask)
    # [False False  True  True False]

    # print(df[mask])
    #         Jan       Feb       Mar       Apr       May  June
    # 2  0.713001  0.073601  0.242077  0.553908       NaN   NaN
    # 3  0.245295  0.007016  0.444352  0.515705  0.497119   NaN
    return df[mask]

def holey(s):
    starts_at = s.notnull().argmax()
    next_null = s[starts_at:].isnull().argmax()
    if next_null == 0:
        return False
    any_values_left = s[next_null:].notnull().any()
    return any_values_left

def remove_using_holey(df):
    mask = df.apply(holey, axis=1)
    return df[~mask]

以下是时间结果：

In [78]: %timeit remove_using_holey(df)
1000 loops, best of 3: 1.53 ms per loop

In [79]: %timeit remove_rows_with_holes(df)
10000 loops, best of 3: 85.6 us per loop

随着DataFrame中行数的增加，差异变得更加显着：

In [85]: df = pd.concat([df]*100)

In [86]: %timeit remove_using_holey(df)
1 loops, best of 3: 1.29 s per loop

In [87]: %timeit remove_rows_with_holes(df)
1000 loops, best of 3: 440 us per loop

In [88]: 1.29 * 10**6 / 440
Out[88]: 2931.818181818182

Answer 3

我遇到了与OP类似的问题。不知道为什么unutbu的解决方案对我没用，但是这样做了：

def remove_rows_with_holes(df):
    nans = np.isnan(df.values)
    mask = np.array(np.prod(~nans, axis=1), dtype=bool)
    return df[mask]

要忽略列，请在制作蒙版之前将其删除。

感谢您的帮助！

使用Pandas的NaN来过滤掉时间序列中的漏洞

3 个答案: