删除符合条件

时间:2017-06-12 07:55:47

标签: python pandas numpy

我有一个带有时间序列(第1列)的数据框和一个带有值(第2列)的列,它们是时间序列的每个子系列的特征。 如何删除符合条件的子系列?

图片说明了我想做的事情。我想删除橙色行: enter image description here

我尝试创建一个循环来创建一个额外的列,其中包含要删除哪些行的功能,但此解决方案的计算成本非常高(我在列中有10mln记录)。代码(慢速解决方案):

import numpy as np
import pandas as pd 

# sample data (smaller than actual df)
# length of df = 100; should be 10000000 in the actual data frame
time_ser = 100*[25]
max_num = 20
distance = np.random.uniform(0,max_num,100)
to_remove= 100*[np.nan]

data_dict = {'time_ser':time_ser, 
             'distance':distance,
             'to_remove': to_remove
             }

df = pd.DataFrame(data_dict)

subser_size = 3
maxdist = 18


# loop which creates an additional column which indicates which indexes should be removed.
# Takes first value in a subseries and checks if it meets the condition. 
# If it does, all values in subseries (i.e. rows) should be removed ('wrong'). 

for i,d in zip(range(len(df)), df.distance):
    if d >= maxdist:
        df.to_remove.iloc[i:i+subser_size] = 'wrong'  
    else:
        df.to_remove.iloc[i] ='good'

1 个答案:

答案 0 :(得分:1)

您可以使用列表推导来numpy.concatenate创建索引数组,并使用numpy.unique删除重复项。

然后使用drop或需要新列loc

np.random.seed(123)
time_ser = 100*[25]
max_num = 20
distance = np.random.uniform(0,max_num,100)
to_remove= 100*[np.nan]

data_dict = {'time_ser':time_ser, 
             'distance':distance,
             'to_remove': to_remove
             }

df = pd.DataFrame(data_dict)
print (df)
     distance  time_ser  to_remove
0   13.929384        25        NaN
1    5.722787        25        NaN
2    4.537029        25        NaN
3   11.026295        25        NaN
4   14.389379        25        NaN
5    8.462129        25        NaN
6   19.615284        25        NaN
7   13.696595        25        NaN
8    9.618638        25        NaN
9    7.842350        25        NaN
10   6.863560        25        NaN
11  14.580994        25        NaN
subser_size = 3
maxdist = 18

print (df.index[df['distance'] >= maxdist])
Int64Index([6, 38, 47, 84, 91], dtype='int64')

arr = [np.arange(i, min(i+subser_size,len(df))) for i in df.index[df['distance'] >= maxdist]]
idx = np.unique(np.concatenate(arr))
print (idx)
[ 6  7  8 38 39 40 47 48 49 84 85 86 91 92 93]

df = df.drop(idx)
print (df)
     distance  time_ser  to_remove
0   13.929384        25        NaN
1    5.722787        25        NaN
2    4.537029        25        NaN
3   11.026295        25        NaN
4   14.389379        25        NaN
5    8.462129        25        NaN
9    7.842350        25        NaN
10   6.863560        25        NaN
11  14.580994        25        NaN
...
...

如果列中需要值:

df['to_remove'] = 'good'
df.loc[idx, 'to_remove'] = 'wrong' 
print (df)
     distance  time_ser to_remove
0   13.929384        25      good
1    5.722787        25      good
2    4.537029        25      good
3   11.026295        25      good
4   14.389379        25      good
5    8.462129        25      good
6   19.615284        25     wrong
7   13.696595        25     wrong
8    9.618638        25     wrong
9    7.842350        25      good
10   6.863560        25      good
11  14.580994        25      good