我有一个pandas数据框,它包含一些重复的行,所以我想删除它们但有条件:
wave num stlines fwhm EWs MeasredWave rv
0 4050.32 3 0.28269 0.07365 22.16080 4050.311360 0.639507
1 4208.98 5 0.48122 0.08765 44.90035 4208.972962 0.501295
2 4208.98 6 0.49994 0.08220 43.74591 4208.974061 0.423016
3 4512.99 2 0.19428 0.09145 18.91216 4512.981301 0.577864
4 4512.99 2 0.21029 0.08860 19.83386 4512.981389 0.572018
5 4520.22 7 0.65300 0.11791 81.95775 4520.214169 0.386727
6 4520.22 4 0.66772 0.11591 82.38548 4520.212833 0.475334
7 4523.08 6 0.13789 0.11303 16.59034 4523.060226 1.310633
8 4523.40 1 0.41672 0.09892 43.87775 4523.390305 0.642545
9 5797.87 3 0.27062 0.15473 44.57125 5797.850820 0.991747
10 5797.87 4 0.28240 0.14991 45.06534 5797.848945 1.088698
dir1 = os.listdir('/home/Desktop/computed_2d/')
for filename in dir1:
if filename.endswith('.ares'):
df1 = pd.read_table(path1+filename, skiprows=0, usecols=(0,1,2,3,4,8,10),names=['wave','num','stlines','fwhm','EWs','MeasredWave','rv'],delimiter=r'\s+')
#dup_rows gives the duplicate rows on the basis of column 'wave'
dup_rows = df1[df1.duplicated(['wave'], keep=False)]
computed_rv = 0.50641
现在我想做的是,我想删除重复行,其df1.rv
的值几乎等于computed_rv
的值。
例如:从第1行和第2行,我想保留第1行,因为df1.rv
的值几乎等于compted_rv
。
值可能低于或高于computed_rv
,例如(0.34和0.30)或(0.99和1.8),然后我想保留df1.rv
的值接近{{1}的行},就像这里,我想保留0.34和0.99
我该怎么做?
答案 0 :(得分:3)
IIUC:
query
computed_rv = 0.50641
tol = 0.01
df1.query('abs(rv - @computed_rv) < @tol')
wave num stlines fwhm EWs MeasredWave rv
1 4208.98 5 0.48122 0.08765 44.90035 4208.972962 0.501295
is_close
computed_rv = 0.50641
tol = 0.01
df1[np.isclose(df1.rv, computed_rv, atol=tol)]
wave num stlines fwhm EWs MeasredWave rv
1 4208.98 5 0.48122 0.08765 44.90035 4208.972962 0.501295
computed_rv = 0.50641
tol = 0.01
df1[df1.rv.sub(computed_rv).abs().lt(tol)]
wave num stlines fwhm EWs MeasredWave rv
1 4208.98 5 0.48122 0.08765 44.90035 4208.972962 0.501295
答案 1 :(得分:1)
您可以确定rv应该匹配多少的阈值,并排除与特定条件不匹配的行,这里我使用了10%的增量和减量到“ rv”列来计算rv
Tab1 :
localStorage.setItem('A','1');
Tab2 :
localStorage.setItem('A','2');
出局:
computed_rv = 0.50641
threshold = 0.1*computed_rv
df[(df.rv.ge(computed_rv-threshold) & df.rv.le(computed_rv+threshold))]