通过使用熊猫中几乎相等的条件基于列值删除重复的行

时间:2018-09-11 13:36:29

标签: python-2.7 pandas dataframe duplicates

我有一个pandas数据框,它包含一些重复的行,所以我想删除它们但有条件:

        wave  num  stlines     fwhm       EWs  MeasredWave        rv
0    4050.32    3  0.28269  0.07365  22.16080  4050.311360  0.639507
1    4208.98    5  0.48122  0.08765  44.90035  4208.972962  0.501295
2    4208.98    6  0.49994  0.08220  43.74591  4208.974061  0.423016
3    4512.99    2  0.19428  0.09145  18.91216  4512.981301  0.577864
4    4512.99    2  0.21029  0.08860  19.83386  4512.981389  0.572018
5    4520.22    7  0.65300  0.11791  81.95775  4520.214169  0.386727
6    4520.22    4  0.66772  0.11591  82.38548  4520.212833  0.475334
7    4523.08    6  0.13789  0.11303  16.59034  4523.060226  1.310633
8    4523.40    1  0.41672  0.09892  43.87775  4523.390305  0.642545
9    5797.87    3  0.27062  0.15473  44.57125  5797.850820  0.991747
10   5797.87    4  0.28240  0.14991  45.06534  5797.848945  1.088698

dir1 = os.listdir('/home/Desktop/computed_2d/')
for filename in dir1:
    if filename.endswith('.ares'):
       df1 = pd.read_table(path1+filename, skiprows=0, usecols=(0,1,2,3,4,8,10),names=['wave','num','stlines','fwhm','EWs','MeasredWave','rv'],delimiter=r'\s+')

       #dup_rows gives the duplicate rows on the basis of column 'wave'
       dup_rows = df1[df1.duplicated(['wave'], keep=False)]

       computed_rv = 0.50641

现在我想做的是,我想删除重复行,其df1.rv的值几乎等于computed_rv的值。

例如:从第1行和第2行,我想保留第1行,因为df1.rv的值几乎等于compted_rv

值可能低于或高于computed_rv,例如(0.34和0.30)或(0.99和1.8),然后我想保留df1.rv的值接近{{1}的行},就像这里,我想保留0.34和0.99

我该怎么做?

2 个答案:

答案 0 :(得分:3)

IIUC:

query

computed_rv = 0.50641
tol = 0.01

df1.query('abs(rv - @computed_rv) < @tol')

      wave  num  stlines     fwhm       EWs  MeasredWave        rv
1  4208.98    5  0.48122  0.08765  44.90035  4208.972962  0.501295

is_close

computed_rv = 0.50641
tol = 0.01

df1[np.isclose(df1.rv, computed_rv, atol=tol)]

      wave  num  stlines     fwhm       EWs  MeasredWave        rv
1  4208.98    5  0.48122  0.08765  44.90035  4208.972962  0.501295

熊猫

computed_rv = 0.50641
tol = 0.01

df1[df1.rv.sub(computed_rv).abs().lt(tol)]

      wave  num  stlines     fwhm       EWs  MeasredWave        rv
1  4208.98    5  0.48122  0.08765  44.90035  4208.972962  0.501295

答案 1 :(得分:1)

您可以确定rv应该匹配多少的阈值,并排除与特定条件不匹配的行,这里我使用了10%的增量和减量到“ rv”列来计算rv

Tab1 :
   localStorage.setItem('A','1');

Tab2 :
   localStorage.setItem('A','2');

出局:

computed_rv = 0.50641
threshold =  0.1*computed_rv
df[(df.rv.ge(computed_rv-threshold) & df.rv.le(computed_rv+threshold))]