使用Pandas达到阈值(最小/最大)值后,删除数据框中的值

时间:2016-07-10 23:15:23

标签: python numpy pandas

我想为整个数据框制作一个过滤器,其中包括C列以外的许多列。我希望此过滤器在达到最小阈值后返回每列中的值,并在a时停止已达到最大阈值。我希望最小阈值为6.5,最大值为9.0。它并不像听起来那么简单,所以请跟我一起......

数据框:

Time    A1  A2  A3
1   6.305   6.191   5.918
2   6.507   6.991   6.203
3   6.407   6.901   6.908
4   6.963   7.127   7.116
5   7.227   7.330   7.363
6   7.445   7.632   7.575
7   7.710   7.837   7.663
8   8.904   8.971   8.895
9   9.394   9.194   8.994
10  8.803   8.113   9.333
11  8.783   8.783   8.783

期望的结果:

Time    A1  A2  A3
1   NaN     NaN     NaN
2   6.507   6.991   NaN
3   6.407   6.901   6.908
4   6.963   7.127   7.116
5   7.227   7.330   7.363
6   7.445   7.632   7.575
7   7.710   7.837   7.663
8   8.904   8.971   8.895
9   NaN     NaN     8.994
10  NaN     NaN     NaN
11  NaN     NaN     NaN

例如,在A列中,如果在A列中有一个值为6.407,低于6.5阈值,但由于在时间2满足阈值,我想保留数据一旦达到最小阈值。对于上限阈值,在时间段9的A列中,该值高于9.0阈值,因此我希望它省略该值以及超出该值的值,即使其余值小于9.0。我希望能在更多的专栏中进行迭代。

谢谢!!!

2 个答案:

答案 0 :(得分:2)

试试这个:

df 
        A1     A2     A3
Time                     
1     6.305  6.191  5.918
2     6.507  6.991  6.203
3     6.407  6.901  6.908
4     6.963  7.127  7.116
5     7.227  7.330  7.363
6     7.445  7.632  7.575
7     7.710  7.837  7.663
8     8.904  8.971  8.895
9     9.394  9.194  8.994
10    8.803  8.113  9.333
11    8.783  8.783  8.783

df2 = df > 6.5 
df  = df[df2.cumsum()>0]
df2 = df > 9   
df  = df[~(df2.cumsum()>0)]

df 
         A1     A2     A3
Time                     
1       NaN    NaN    NaN
2     6.507  6.991    NaN
3     6.407  6.901  6.908
4     6.963  7.127  7.116
5     7.227  7.330  7.363
6     7.445  7.632  7.575
7     7.710  7.837  7.663
8     8.904  8.971  8.895
9       NaN    NaN  8.994
10      NaN    NaN    NaN
11      NaN    NaN    NaN

答案 1 :(得分:2)

<强>实施

这是使用NumPy boolean indexing -

的矢量化方法
# Extract values into an array
arr = df.values

# Determine the min,max limits along each column
minl = (arr > 6.5).argmax(0)
maxl = (arr>9).argmax(0)

# Setup corresponding boolean mask and set those in array to be NaNs
R = np.arange(arr.shape[0])[:,None]
mask = (R < minl) | (R >= maxl)
arr[mask] = np.nan

# Finally convert to dataframe
df = pd.DataFrame(arr,columns=df.columns)

请注意,或者,可以直接屏蔽输入数据帧而不是重新创建它,但这里有趣的发现是布尔索引到NumPy数组比进入pandas数据帧要快。因为我们正在过滤整个数据帧,所以我们可以重新创建数据帧。

仔细看看

现在,让我们仔细看看面具制作部分,这是解决方案的关键。

1)输入数组:

In [148]: arr
Out[148]: 
array([[ 6.305,  6.191,  5.918],
       [ 6.507,  6.991,  6.203],
       [ 6.407,  6.901,  6.908],
       [ 6.963,  7.127,  7.116],
       [ 7.227,  7.33 ,  7.363],
       [ 7.445,  7.632,  7.575],
       [ 7.71 ,  7.837,  7.663],
       [ 8.904,  8.971,  8.895],
       [ 9.394,  9.194,  8.994],
       [ 8.803,  8.113,  9.333],
       [ 8.783,  8.783,  8.783]])

2)最小,最大限制:

In [149]: # Determine the min,max limits along each column
     ...: minl = (arr > 6.5).argmax(0)
     ...: maxl = (arr>9).argmax(0)
     ...: 

In [150]: minl
Out[150]: array([1, 1, 2])

In [151]: maxl
Out[151]: array([8, 8, 9])

3)使用broadcasting创建一个跨越整个数据框/数组的掩码,并选择要设置为NaNs的元素:

In [152]: R = np.arange(arr.shape[0])[:,None]

In [153]: R
Out[153]: 
array([[ 0],
       [ 1],
       [ 2],
       [ 3],
       [ 4],
       [ 5],
       [ 6],
       [ 7],
       [ 8],
       [ 9],
       [10]])

In [154]: (R < minl) | (R >= maxl)
Out[154]: 
array([[ True,  True,  True],
       [False, False,  True],
       [False, False, False],
       [False, False, False],
       [False, False, False],
       [False, False, False],
       [False, False, False],
       [False, False, False],
       [ True,  True, False],
       [ True,  True,  True],
       [ True,  True,  True]], dtype=bool)

运行时测试

让我们到目前为止列出的方法来解决这个问题,并且由于提到我们会有很多列,所以让我们使用相当多的列。

列为功能的方法:

def cumsum_app(df):    # Listed in other solution by @Merlin
    df2 = df > 6.5 
    df  = df[df2.cumsum()>0]
    df2 = df > 9   
    df  = df[~(df2.cumsum()>0)]

def boolean_indexing_app(df):  # Approaches listed in this post
    arr = df.values
    minl = (arr > 6.5).argmax(0)
    maxl = (arr>9).argmax(0)
    R = np.arange(arr.shape[0])[:,None]
    mask = (R < minl) | (R >= maxl)
    arr[mask] = np.nan
    df = pd.DataFrame(arr,columns=df.columns)

时间:

In [163]: # Create a random array with floating pt numbers between 6 and 10
     ...: df = pd.DataFrame((np.random.rand(11,10000)*4)+6)
     ...: 
     ...: # Create copies for testing approaches
     ...: df1 = df.copy()
     ...: df2 = df.copy()


In [164]: %timeit cumsum_app(df1)
100 loops, best of 3: 16.4 ms per loop

In [165]: %timeit boolean_indexing_app(df2)
100 loops, best of 3: 2.09 ms per loop