如何根据Python和Pandas中的状态排除产品?

时间:2017-11-22 15:11:47

标签: python pandas dataframe filter

假设我有以下带有产品说明的Pandas DataFrame。 如何排除{4}或6以外id的所有产品(status)?

输入

    id | description | status
    -------------------------
    1  | world1      | 1
    1  | world2      | 4
    1  | world3      | 1
    1  | world4      | 4
    1  | world5      | 4
    1  | world6      | 4
    1  | world7      | 1
    1  | world8      | 4
    1  | world9      | 4
    1  | world10     | 4
    1  | world11     | 4
    1  | world12     | 4
    1  | world13     | 4
    1  | world14     | 4
    1  | world15     | 1
    2  | world1      | 4
    2  | world2      | 4
    2  | world3      | 5
    2  | world15     | 6
    2  | world8      | 6
    2  | world4      | 5
    2  | world7      | 5

输出:

    id | description | status
    -------------------------
    2  | world1      | 4
    2  | world2      | 4
    2  | world3      | 5
    2  | world15     | 6
    2  | world8      | 6
    2  | world4      | 5
    2  | world7      | 5

2 个答案:

答案 0 :(得分:1)

首先过滤包含id中其他值的所有list,然后过滤所有不包含id值的a

L = [4,5,6]
a = df.loc[~df['status'].isin(L), 'id']
df = df[~df['id'].isin(a)]
print (df)
    id description  status
15   2      world1       4
16   2      world2       4
17   2      world3       5
18   2     world15       6
19   2      world8       6
20   2      world4       5
21   2      world7       5

详情:

print (a)
0     1
2     1
6     1
14    1
Name: id, dtype: int64

<强>计时

np.random.seed(123)
N = 100000

L = np.random.randint(1000,size=N)
df = pd.DataFrame({'status': np.random.choice([4,5,6,7], p = (0.3,0.3,0.39,0.01), size=N),
                   'id':np.random.choice(L, N),
                   'description':np.random.choice(L, N)})
print (df)


L = [4,5,6]

In [461]: %%timeit 
     ...: a = df.loc[~df['status'].isin(L), 'id']
     ...: df[~df['id'].isin(a)]
     ...: 
     ...: 
100 loops, best of 3: 1.91 ms per loop

#Wen's solution
In [462]: %%timeit
     ...: df['status']=df['status'].mask(~df['status'].isin([4,5,6]))
     ...: df.groupby('id').filter(lambda x : ~x.status.isnull().any() )
     ...: 
10 loops, best of 3: 111 ms per loop

答案 1 :(得分:1)

两步

第一次使用mask

df['status']=df['status'].mask(~df['status'].isin([4,5,6]))

第二groupby + filter

df.groupby('id').filter(lambda x : ~x.status.isnull().any() )
Out[44]: 
    id description  status
15   2      world1     4.0
16   2      world2     4.0
17   2      world3     5.0
18   2     world15     6.0
19   2      world8     6.0
20   2      world4     5.0
21   2      world7     5.0