我对昆虫进行了几次假设性测试。我想删除 result_1值低于10' 的行,我认为这些行不重要,但希望将NaN保留为单行的值显示执行了哪种测试以及哪种昆虫。
from pandas import Series, DataFrame
import numpy as np
A = Series(['A','A','B','B','B','C'])
B = Series(['ant','flea','flea','spider','spider','flea'])
C = Series([88,77,1,3,2,67])
D = Series(np.random.randn(6))
df = DataFrame({'test':A.values,'insect':B.values,
'result_1':C.values,'result_2':D.values},
columns=['test','insect','result_1','result_2'])
df
所以原始的Dataframe看起来像这样:
因为索引2,3和4的 results_1 值<10,所以我想删除所有这些行,但要注意一行是左边的(NaN在中都是< / strong>结果列)显示测试B是在跳蚤(index2)上进行的,应留下一行来表明测试B确实是在蜘蛛上进行的(索引3和4,需要丢弃一个另一个需要在结果列中插入NaN)。
因此,生成的Dataframe应如下所示:
答案 0 :(得分:2)
我认为你可以使用:
#add NaN by condition
df.loc[df.result_1 < 10, ['result_1','result_2']] = np.nan
#drop duplicated by column insect
df[df.result_1.isnull()] = df[df.result_1.isnull()].drop_duplicates(subset='insect')
df = df.dropna(how='all')
print (df)
test insect result_1 result_2
0 A ant 88.0 -0.037844
1 A flea 77.0 -1.088879
2 B flea NaN NaN
3 B spider NaN NaN
5 C flea 67.0 1.455632
找到相关索引的另一个解决方案然后drop
行index
:
mask = df.result_1 < 10
df.loc[mask, ['result_1','result_2']] = np.nan
a = df[mask].duplicated(subset='insect')
print (a)
2 False
3 False
4 True
dtype: bool
a = a[a].index
df = df.drop(a)
print (df)
test insect result_1 result_2
0 A ant 88.0 -0.176274
1 A flea 77.0 -0.123691
2 B flea NaN NaN
3 B spider NaN NaN
5 C flea 67.0 -0.310655