我想为整个数据框制作一个过滤器,其中包括C列以外的许多列。我希望此过滤器在达到最小阈值后返回每列中的值,并在a时停止已达到最大阈值。我希望最小阈值为6.5,最大值为9.0。它并不像听起来那么简单,所以请跟我一起......
数据框:
Time A1 A2 A3
1 6.305 6.191 5.918
2 6.507 6.991 6.203
3 6.407 6.901 6.908
4 6.963 7.127 7.116
5 7.227 7.330 7.363
6 7.445 7.632 7.575
7 7.710 7.837 7.663
8 8.904 8.971 8.895
9 9.394 9.194 8.994
10 8.803 8.113 9.333
11 8.783 8.783 8.783
期望的结果:
Time A1 A2 A3
1 NaN NaN NaN
2 6.507 6.991 NaN
3 6.407 6.901 6.908
4 6.963 7.127 7.116
5 7.227 7.330 7.363
6 7.445 7.632 7.575
7 7.710 7.837 7.663
8 8.904 8.971 8.895
9 NaN NaN 8.994
10 NaN NaN NaN
11 NaN NaN NaN
例如,在A列中,如果在A列中有一个值为6.407,低于6.5阈值,但由于在时间2满足阈值,我想保留数据一旦达到最小阈值。对于上限阈值,在时间段9的A列中,该值高于9.0阈值,因此我希望它省略该值以及超出该值的值,即使其余值小于9.0。我希望能在更多的专栏中进行迭代。
谢谢!!!
答案 0 :(得分:2)
试试这个:
df
A1 A2 A3
Time
1 6.305 6.191 5.918
2 6.507 6.991 6.203
3 6.407 6.901 6.908
4 6.963 7.127 7.116
5 7.227 7.330 7.363
6 7.445 7.632 7.575
7 7.710 7.837 7.663
8 8.904 8.971 8.895
9 9.394 9.194 8.994
10 8.803 8.113 9.333
11 8.783 8.783 8.783
df2 = df > 6.5
df = df[df2.cumsum()>0]
df2 = df > 9
df = df[~(df2.cumsum()>0)]
df
A1 A2 A3
Time
1 NaN NaN NaN
2 6.507 6.991 NaN
3 6.407 6.901 6.908
4 6.963 7.127 7.116
5 7.227 7.330 7.363
6 7.445 7.632 7.575
7 7.710 7.837 7.663
8 8.904 8.971 8.895
9 NaN NaN 8.994
10 NaN NaN NaN
11 NaN NaN NaN
答案 1 :(得分:2)
<强>实施强>
这是使用NumPy boolean indexing
-
# Extract values into an array
arr = df.values
# Determine the min,max limits along each column
minl = (arr > 6.5).argmax(0)
maxl = (arr>9).argmax(0)
# Setup corresponding boolean mask and set those in array to be NaNs
R = np.arange(arr.shape[0])[:,None]
mask = (R < minl) | (R >= maxl)
arr[mask] = np.nan
# Finally convert to dataframe
df = pd.DataFrame(arr,columns=df.columns)
请注意,或者,可以直接屏蔽输入数据帧而不是重新创建它,但这里有趣的发现是布尔索引到NumPy数组比进入pandas数据帧要快。因为我们正在过滤整个数据帧,所以我们可以重新创建数据帧。
仔细看看
现在,让我们仔细看看面具制作部分,这是解决方案的关键。
1)输入数组:
In [148]: arr
Out[148]:
array([[ 6.305, 6.191, 5.918],
[ 6.507, 6.991, 6.203],
[ 6.407, 6.901, 6.908],
[ 6.963, 7.127, 7.116],
[ 7.227, 7.33 , 7.363],
[ 7.445, 7.632, 7.575],
[ 7.71 , 7.837, 7.663],
[ 8.904, 8.971, 8.895],
[ 9.394, 9.194, 8.994],
[ 8.803, 8.113, 9.333],
[ 8.783, 8.783, 8.783]])
2)最小,最大限制:
In [149]: # Determine the min,max limits along each column
...: minl = (arr > 6.5).argmax(0)
...: maxl = (arr>9).argmax(0)
...:
In [150]: minl
Out[150]: array([1, 1, 2])
In [151]: maxl
Out[151]: array([8, 8, 9])
3)使用broadcasting
创建一个跨越整个数据框/数组的掩码,并选择要设置为NaNs
的元素:
In [152]: R = np.arange(arr.shape[0])[:,None]
In [153]: R
Out[153]:
array([[ 0],
[ 1],
[ 2],
[ 3],
[ 4],
[ 5],
[ 6],
[ 7],
[ 8],
[ 9],
[10]])
In [154]: (R < minl) | (R >= maxl)
Out[154]:
array([[ True, True, True],
[False, False, True],
[False, False, False],
[False, False, False],
[False, False, False],
[False, False, False],
[False, False, False],
[False, False, False],
[ True, True, False],
[ True, True, True],
[ True, True, True]], dtype=bool)
运行时测试
让我们到目前为止列出的方法来解决这个问题,并且由于提到我们会有很多列,所以让我们使用相当多的列。
列为功能的方法:
def cumsum_app(df): # Listed in other solution by @Merlin
df2 = df > 6.5
df = df[df2.cumsum()>0]
df2 = df > 9
df = df[~(df2.cumsum()>0)]
def boolean_indexing_app(df): # Approaches listed in this post
arr = df.values
minl = (arr > 6.5).argmax(0)
maxl = (arr>9).argmax(0)
R = np.arange(arr.shape[0])[:,None]
mask = (R < minl) | (R >= maxl)
arr[mask] = np.nan
df = pd.DataFrame(arr,columns=df.columns)
时间:
In [163]: # Create a random array with floating pt numbers between 6 and 10
...: df = pd.DataFrame((np.random.rand(11,10000)*4)+6)
...:
...: # Create copies for testing approaches
...: df1 = df.copy()
...: df2 = df.copy()
In [164]: %timeit cumsum_app(df1)
100 loops, best of 3: 16.4 ms per loop
In [165]: %timeit boolean_indexing_app(df2)
100 loops, best of 3: 2.09 ms per loop