Pandas使列布尔和删除不是的行

时间:2016-04-06 00:56:08

标签: python pandas

这是我的pandas数据框的样子:

   id       text          country   datetime
0   1      hello,bye         USA    3/20/2016
1   0      good morning      UK     3/21/2016
2   x      wrong             USA    3/21/2016

我想仅将id列设为boolean,如果value不是boolean,则删除该行。

我试过

df=df[df['id'].bool()]

但获得了valueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

2 个答案:

答案 0 :(得分:1)

IIUC您可以尝试转换列id to_numeric,然后与1进行比较:

print pd.to_numeric(df.id, errors='coerce') == 1
0     True
1    False
2    False
Name: id, dtype: bool

print df[pd.to_numeric(df.id, errors='coerce') == 1]
  id       text country   datetime
0  1  hello bye     USA  3/20/2016

如果您需要删除行,id列不是01,请使用isin

print df.id.isin(['0','1'])
0     True
1     True
2    False
Name: id, dtype: bool

print df[df.id.isin(['0','1'])]
  id          text country   datetime
0  1     hello bye     USA  3/20/2016
1  0  good morning      UK  3/21/2016

to_numericnotnull

print pd.to_numeric(df.id, errors='coerce').notnull()
0     True
1     True
2    False
Name: id, dtype: bool

print df[pd.to_numeric(df.id, errors='coerce').notnull()]
  id          text country   datetime
0  1     hello bye     USA  3/20/2016
1  0  good morning      UK  3/21/2016

最后,您可以将id列转换为replace astypenumpy.in1d加倍{/ 3}}:

bool

编辑:

计时,如果转换为print df.loc[df.id.isin(['0','1']),'id'].replace({'0': False, '1': True}) 0 True 1 False Name: id, dtype: bool print df.loc[df.id.isin(['0','1']),'id'].astype(int).astype(bool) 0 True 1 False Name: id, dtype: bool print df.loc[pd.to_numeric(df.id, errors='coerce').notnull(),'id'].astype(int).astype(bool) 0 True 1 False Name: id, dtype: bool 的值仅为bool0

1

最好的是map {{3}}:

#len(df) = 30k
df = pd.concat([df]*10000).reset_index(drop=True)

In [628]: %timeit df.loc[np.in1d(df['id'], ['0','1']),'id'].map({'0': False, '1': True})
100 loops, best of 3: 2.19 ms per loop

In [629]: %timeit df.loc[np.in1d(df['id'], ['0','1']),'id'].replace({'0': False, '1': True})
The slowest run took 4.46 times longer than the fastest. This could mean that an intermediate result is being cached 
100 loops, best of 3: 4.72 ms per loop

In [630]: %timeit df.loc[df['id'].isin(['0','1']),'id'].map({'0': False, '1': True})
100 loops, best of 3: 2.78 ms per loop

In [631]: %timeit df.loc[df['id'].str.contains('0|1'),'id'].map({'0': False, '1': True})
10 loops, best of 3: 20 ms per loop

In [632]: %timeit df.loc[df['id'].isin(['0','1']),'id'].astype(int).astype(bool)
100 loops, best of 3: 9.5 ms per loop

答案 1 :(得分:0)

您可以使用str.isdigit检查您的id列是否仅包含数字,然后转换为数字然后转换为布尔值:

In [14]: df['id'].str.isdigit()
Out[14]:
0     True
1     True
2    False
Name: id, dtype: 

仅限数字的子集:

In [15]: df.loc[df['id'].str.isdigit(), 'id']
Out[15]:
0    1
1    0
Name: id, dtype: object

转换为bool:

In [17]: df.loc[df['id'].str.isdigit(), 'id'].astype(int).astype(bool)
Out[17]:
0     True
1    False
Name: id, dtype: bool

pd.to_numeric的比较:

In [18]: %timeit pd.to_numeric(df.id, errors='coerce').notnull()
10000 loops, best of 3: 178 us per loop

In [19]: %timeit df['id'].str.isdigit()
10000 loops, best of 3: 128 us per loop