我想删除所有熊猫行,其中在特定的预定义范围内,两列中的值彼此接近。
例如:
df = pd.DataFrame({'a':[1,2,3,4,5,6], \
'b':[20.02,19.96,19.98,20.10,26.75,56.12],\
'c':[10.12,10.10,123.54,124.12,245.12,895.21]})
a b c
1 20.02 10.12
2 19.96 10.10
3 19.98 123.54
4 20.10 124.12
5 26.75 245.12
6 56.12 895.21
根据列b和c过滤行:如果b
和c
的当前值接近(在1%之内)先前接受的行的值:
(0.99*previous_b < b < 1.01*previous_b) && (0.99*previous_c < c < 1.01*previous_c)
然后将它们排除在外。
结果
a b c
1 20.02 10.12
3 19.98 123.54
5 26.75 245.12
6 56.12 895.21
我可以对一个数字使用numpy.isclose:
df['b'].apply(np.isclose, b=20.02, atol=0.01 * 20.02)
如何概括这一点,以便在所有pandas列中迭代应用此条件,并将此条件应用于两个不同的列?
旁注: 我的熊猫数据框中有200万行。因此,我想知道最有效的方法。
答案 0 :(得分:2)
鉴于要比较的行可能会根据每次比较的结果而变化,因此我不确定如果不使用与for循环等效的逻辑就可以实现此目的:
#Taking initial comparison values from first row
b,c = df.iloc[0][['b','c']]
#Including first row in result
filters = [True]
#Skipping first row in comparisons
for index, row in df.iloc[1:].iterrows():
if 0.99*b <= row['b'] <= 1.01*b and 0.99*c <= row['c'] <= 1.01*c:
filters.append(False)
else:
filters.append(True)
# Updating values to compare based on latest accepted row
b = row['b']
c = row['c']
df2 = df.loc[filters]
print(df2)
a b c
0 1 20.02 10.12
2 3 19.98 123.54
4 5 26.75 245.12
5 6 56.12 895.21
检查row(n + 1)接近row(n)(但不包括),但是row(n + 2)接近row(n + 1)但不接近row(n)的边缘情况)(因此应包括在内)
df = pd.DataFrame({'a':[1,2,3], \
'b':[20,20,20],\
'c':[100,100.9,101.1]})
a b c
0 1 20 100.0
2 3 20 101.1
答案 1 :(得分:0)
很大程度上基于ukemi的较早答案。在此示例中,将每个列的值与所有先前接受的行进行比较,而不仅仅是最后接受的行。
df = pd.DataFrame({'a':[1,2,3,4,5,6,7,8,9],'b':[20.02,19.96,19.98,20.10,26.75,56.12, 20.04,56.24, 56.15],\
'c':[10.12,10.10,123.54,124.12,245.12,6.00,10.11,6.50,128.67]})
a b c
0 1 20.02 10.12
1 2 19.96 10.10
2 3 19.98 123.54
3 4 20.10 124.12
4 5 26.75 245.12
5 6 56.12 6.00
6 7 20.04 10.11
7 8 56.24 6.50
8 9 56.15 128.67
b = []
c = []
#Taking initial comparison values from first row
b.append(df.iloc[0]['b'])
c.append(df.iloc[0]['c'])
#Including first row in result
filters = [True]
#Skipping first row in comparisons
for index, row in df.iloc[1:].iterrows():
tag = 0
for i in range(len(b)):
#Thresholds have been changed to 5% and 10% respectively in this case.
if 0.95*b[i] <= row['b'] <= 1.05*b[i] and 0.90*c[i] <= row['c'] <= 1.10*c[i]:
filters.append(False)
tag = 1
break
if tag == 0:
filters.append(True)
# Updating values to compare based on latest accepted row
b.append(row['b'])
c.append(row['c'])
df2 = df.loc[filters]
print(df2)
a b c
0 1 20.02 10.12
2 3 19.98 123.54
4 5 26.75 245.12
5 6 56.12 6.00
8 9 56.15 128.67
请让我知道是否有更快的方法来达到相同的结果。