根据整行屏蔽Pandas DataFrame行

时间:2014-05-22 06:06:22

标签: python pandas

背景:

我正在使用8波段多光谱卫星图像并根据反射率值估算水深。使用statsmodels,我想出了一个OLS模型,它将根据该像素的8个反射率值预测每个像素的深度。为了能够轻松地使用OLS模型,我将所有像素反射率值都粘贴到一个像下面示例中那样的pandas数据帧中;其中每行代表一个像素,每列是多光谱图像的光谱带。

由于一些预处理步骤,所有岸上像素都已转换为全零。我不想尝试预测这些像素的“深度”,因此我想将我的OLS模型预测限制为非全零值的行。

我需要将结果重新整形为原始图像的行x col尺寸,这样我就不能只删除所有零行。

具体问题:

我有一个Pandas数据帧。某些行包含全零。我想掩盖这些行进行一些计算,但我需要保留行。我无法弄清楚如何屏蔽所有零行的条目。

例如:

In [1]: import pandas as pd
In [2]: import numpy as np
        # my actual data has about 16 million rows so
        # I'll simulate some data for the example. 
In [3]: cols = ['band1','band2','band3','band4','band5','band6','band7','band8']
In [4]: rdf = pd.DataFrame(np.random.randint(0,10,80).reshape(10,8),columns=cols)
In [5]: zdf = pd.DataFrame(np.zeros( (3,8) ),columns=cols)
In [6]: df = pd.concat((rdf,zdf)).reset_index(drop=True)
In [7]: df
Out[7]: 
        band1  band2  band3  band4  band5  band6  band7  band8
    0       9      9      8      7      2      7      5      6
    1       7      7      5      6      3      0      9      8
    2       5      4      3      6      0      3      8      8
    3       6      4      5      0      5      7      4      5
    4       8      3      2      4      1      3      2      5
    5       9      7      6      3      8      7      8      4
    6       6      2      8      2      2      6      9      8
    7       9      4      0      2      7      6      4      8
    8       1      3      5      3      3      3      0      1
    9       4      2      9      7      3      5      5      0
    10      0      0      0      0      0      0      0      0
    11      0      0      0      0      0      0      0      0
    12      0      0      0      0      0      0      0      0

    [13 rows x 8 columns]

我知道我可以通过这样做得到我感兴趣的行:

In [8]: df[df.any(axis=1)==True]
Out[8]: 
       band1  band2  band3  band4  band5  band6  band7  band8
    0      9      9      8      7      2      7      5      6
    1      7      7      5      6      3      0      9      8
    2      5      4      3      6      0      3      8      8
    3      6      4      5      0      5      7      4      5
    4      8      3      2      4      1      3      2      5
    5      9      7      6      3      8      7      8      4
    6      6      2      8      2      2      6      9      8
    7      9      4      0      2      7      6      4      8
    8      1      3      5      3      3      3      0      1
    9      4      2      9      7      3      5      5      0

   [10 rows x 8 columns]

但是我需要稍后重新整形数据,所以我需要这些行在正确的位置。我尝试了各种各样的事情,包括df.where(df.any(axis=1)==True),但我找不到任何有用的东西。

失败:

  1. df.any(axis=1)==True为我感兴趣的行TrueFalse为我想要屏蔽的行提供{但是当我尝试df.where(df.any(axis=1)==True)时只需使用所有零返回整个数据框即可。我想要整个数据框,但是这些零行中的所有值都被屏蔽了,据我所知,它们应该显示为Nan,对吗?

  2. 我尝试获取所有零的行索引并按行掩盖:

    mskidxs = df[df.any(axis=1)==False].index
    df.mask(df.index.isin(mskidxs))
    

    这对我没有任何作用:

    ValueError: Array conditional must be same shape as self
    

    .index只是给了Int64Index。我需要一个与我的数据框尺寸相同的布尔数组,但我无法弄清楚如何得到它。

  3. 提前感谢您的帮助。

    -Jared

2 个答案:

答案 0 :(得分:2)

澄清我的问题的过程使我以迂回的方式找到答案。 This question也帮助我指明了正确的方向。这是我想出来的:

import pandas as pd
# Set up my fake test data again. My actual data is described
# in the question.
cols = ['band1','band2','band3','band4','band5','band6','band7','band8']
rdf = pd.DataFrame(np.random.randint(0,10,80).reshape(10,8),columns=cols)
zdf = pd.DataFrame(np.zeros( (3,8) ),columns=cols)
df = pd.concat((zdf,rdf)).reset_index(drop=True)

# View the dataframe. (sorry about the alignment, I don't
# want to spend the time putting in all the spaces)
df

    band1   band2   band3   band4   band5   band6   band7   band8
0   0   0   0   0   0   0   0   0
1   0   0   0   0   0   0   0   0
2   0   0   0   0   0   0   0   0
3   6   3   7   0   1   7   1   8
4   9   2   6   8   7   1   4   3
5   4   2   1   1   3   2   1   9
6   5   3   8   7   3   7   5   2
7   8   2   6   0   7   2   0   7
8   1   3   5   0   7   3   3   5
9   1   8   6   0   1   5   7   7
10  4   2   6   2   2   2   4   9
11  8   7   8   0   9   3   3   0
12  6   1   6   8   2   0   2   5

13 rows × 8 columns

# This is essentially the same as item #2 under Fails
# in my question. It gives me the indexes of the rows
# I want unmasked as True and those I want masked as
# False. However, the result is not the right shape to
# use as a mask.
df.apply( lambda row: any([i<>0 for i in row]),axis=1 )
0     False
1     False
2     False
3      True
4      True
5      True
6      True
7      True
8      True
9      True
10     True
11     True
12     True
dtype: bool

# This is what actually works. By setting broadcast to
# True, I get a result that's the right shape to use.
land_rows = df.apply( lambda row: any([i<>0 for i in row]),axis=1, 
                      broadcast=True )

land_rows

Out[92]:
    band1   band2   band3   band4   band5   band6   band7   band8
0   0   0   0   0   0   0   0   0
1   0   0   0   0   0   0   0   0
2   0   0   0   0   0   0   0   0
3   1   1   1   1   1   1   1   1
4   1   1   1   1   1   1   1   1
5   1   1   1   1   1   1   1   1
6   1   1   1   1   1   1   1   1
7   1   1   1   1   1   1   1   1
8   1   1   1   1   1   1   1   1
9   1   1   1   1   1   1   1   1
10  1   1   1   1   1   1   1   1
11  1   1   1   1   1   1   1   1
12  1   1   1   1   1   1   1   1

13 rows × 8 columns

# This produces the result I was looking for:
df.where(land_rows)

Out[93]:
    band1   band2   band3   band4   band5   band6   band7   band8
0   NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN
1   NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN
2   NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN
3   6   3   7   0   1   7   1   8
4   9   2   6   8   7   1   4   3
5   4   2   1   1   3   2   1   9
6   5   3   8   7   3   7   5   2
7   8   2   6   0   7   2   0   7
8   1   3   5   0   7   3   3   5
9   1   8   6   0   1   5   7   7
10  4   2   6   2   2   2   4   9
11  8   7   8   0   9   3   3   0
12  6   1   6   8   2   0   2   5

13 rows × 8 columns

再次感谢那些帮助过的人。希望我找到的解决方案在某些时候对某人有用。

我发现了另一种做同样事情的方法。涉及的步骤更多,但根据%timeit,它的速度提高了约9倍。这是:

def mask_all_zero_rows_numpy(df):
    """
    Take a dataframe, find all the rows that contain only zeros
    and mask them. Return a dataframe of the same shape with all
    Nan rows in place of the all zero rows.
    """
    no_data = -99
    arr = df.as_matrix().astype(int16)
    # make a row full of the 'no data' value
    replacement_row = np.array([no_data for x in range(arr.shape[1])], dtype=int16)
    # find out what rows are all zeros
    mask_rows = ~arr.any(axis=1)
    # replace those all zero rows with all 'no_data' rows
    arr[mask_rows] = replacement_row
    # create a masked array with the no_data value masked
    marr = np.ma.masked_where(arr==no_data,arr)
    # turn masked array into a data frame
    mdf = pd.DataFrame(marr,columns=df.columns)
    return mdf

mask_all_zero_rows_numpy(df)的结果应与上面的Out[93]:相同。

答案 1 :(得分:0)

我不清楚为什么你不能简单地只对行的一部分执行计算:

np.average(df[1][:11])

排除零行。

或者您可以在切片上进行计算并将计算值读回原始数据帧:

dfs = df[:10]
dfs['1_deviation_from_mean'] = pd.Series([abs(np.average(dfs[1]) - val) for val in dfs[1]])
df['deviation_from_mean'] = dfs['1_deviation_from_mean']

或者,您可以创建要屏蔽的索引点的列表,然后使用numpy蒙板数组进行计算,通过使用np.ma.masked_where()方法创建并指定屏蔽索引位置中的值:

row_for_mask = [row for row in df.index if all(df.loc[row] == 0)]
masked_array = np.ma.masked_where(df[1].index.isin(row_for_mask), df[1])
np.mean(masked_array)

蒙面数组如下所示:

Name: 1, dtype: float64(data =
0      5
1      0
2      0
3      4
4      4
5      4
6      3
7      1
8      0
9      9
10    --
11    --
12    --
Name: 1, dtype: object,