Question

我有一个值列表（事先不知道，在Python列表中），我的Panda DataFrame中的列不能包含所有行。

网络上的所有食谱（如this one）都显示如何只使用一个要排除的值，但我有多个要排除的值。我该怎么做？

请注意，我无法在我的代码中硬编码要排除的值。

谢谢！

Answer 1

您可以执行否定isin()索引：

In [57]: df
Out[57]:
   a  b  c
0  1  2  2
1  1  7  0
2  3  7  1
3  3  2  7
4  1  3  1
5  3  4  2
6  0  7  1
7  5  4  3
8  6  1  0
9  3  2  0

In [58]: my_list = [1, 7, 8]

In [59]: df.loc[~df.b.isin(my_list)]
Out[59]:
   a  b  c
0  1  2  2
3  3  2  7
4  1  3  1
5  3  4  2
7  5  4  3
9  3  2  0

或使用query()功能：

In [60]: df.query('@my_list not in b')
Out[60]:
   a  b  c
0  1  2  2
3  3  2  7
4  1  3  1
5  3  4  2
7  5  4  3
9  3  2  0

Answer 2

你也可以使用np.in1d。来自https://stackoverflow.com/a/38083418/2336654

对于您的用例：

df[~np.in1d(df.b, my_list)]

示范

from string import ascii_letters, ascii_lowercase, ascii_uppercase
df = pd.DataFrame({'lower': list(ascii_lowercase), 'upper': list(ascii_uppercase)}).head(6)
exclude = list(ascii_uppercase[:6:2])

print df

  lower upper
0     a     A
1     b     B
2     c     C
3     d     D
4     e     E
5     f     F

print exclude

['A', 'C', 'E']

print df[~np.in1d(df.upper, exclude)]

  lower upper
1     b     B
3     d     D
5     f     F

时序

各种方法

包含3个项目的130万行

130万行，不包括12项

结论

显然，isin和query在这种情况下会更好地扩展。

Answer 3

我决定为不同的方法和不同的dtypes添加另一个时间答案 - 对于一个答案来说太长了......

针对以下dtypes的针对1M行DF的计时：int32，int64，float64，object（string）：

In [207]: result
Out[207]:
                              int32  int64  float  string
method
df[~np.in1d(df.col, excl)]      249    271    307    2420
df[~df.col.isin(excl)]          414    210    514     189
df.ix[~df.col.isin(excl)]       399    212    553     189
df.query('@excl not in col')    415    228    563     206

In [208]: result.T
Out[208]:
method  df[~np.in1d(df.col, excl)]  df[~df.col.isin(excl)]  df.ix[~df.col.isin(excl)]  df.query('@excl not in col')
int32                          249                     414                        399                           415
int64                          271                     210                        212                           228
float                          307                     514                        553                           563
string                        2420                     189                        189                           206

原始结果：

<强> INT32：

In [159]: %timeit df[~np.in1d(df.int32, exclude_int32)]
1 loop, best of 3: 249 ms per loop

In [160]: %timeit df[~df.int32.isin(exclude_int32)]
1 loop, best of 3: 414 ms per loop

In [161]: %timeit df.ix[~df.int32.isin(exclude_int32)]
1 loop, best of 3: 399 ms per loop

In [162]: %timeit df.query('@exclude_int32 not in int32')
1 loop, best of 3: 415 ms per loop

<强>的Int64：

In [163]: %timeit df[~np.in1d(df.int64, exclude_int64)]
1 loop, best of 3: 271 ms per loop

In [164]: %timeit df[~df.int64.isin(exclude_int64)]
1 loop, best of 3: 210 ms per loop

In [165]: %timeit df.ix[~df.int64.isin(exclude_int64)]
1 loop, best of 3: 212 ms per loop

In [166]: %timeit df.query('@exclude_int64 not in int64')
1 loop, best of 3: 228 ms per loop

<强> float64：

In [167]: %timeit df[~np.in1d(df.float, exclude_float)]
1 loop, best of 3: 307 ms per loop

In [168]: %timeit df[~df.float.isin(exclude_float)]
1 loop, best of 3: 514 ms per loop

In [169]: %timeit df.ix[~df.float.isin(exclude_float)]
1 loop, best of 3: 553 ms per loop

In [170]: %timeit df.query('@exclude_float not in float')
1 loop, best of 3: 563 ms per loop

对象/字符串：

In [171]: %timeit df[~np.in1d(df.string, exclude_str)]
1 loop, best of 3: 2.42 s per loop

In [172]: %timeit df[~df.string.isin(exclude_str)]
10 loops, best of 3: 189 ms per loop

In [173]: %timeit df.ix[~df.string.isin(exclude_str)]
10 loops, best of 3: 189 ms per loop

In [174]: %timeit df.query('@exclude_str not in string')
1 loop, best of 3: 206 ms per loop

<强>结论：

np.in1d() - （int32和float64）搜索获胜，但约。搜索字符串时，10次较慢（与其他人相比），因此不要将其用于object（字符串）和int64 dtypes！

<强>设定：

df = pd.DataFrame({
    'int32':    np.random.randint(0, 10**6, 10),
    'int64':    np.random.randint(10**7, 10**9, 10).astype(np.int64)*10,
    'float':    np.random.rand(10),
    'string':   np.random.choice([c*10 for c in string.ascii_uppercase], 10),
    })


df = pd.concat([df] * 10**5, ignore_index=True)

exclude_str = np.random.choice([c*10 for c in string.ascii_uppercase], 100).tolist()
exclude_int32 = np.random.randint(0, 10**6, 100).tolist()
exclude_int64 = (np.random.randint(10**7, 10**9, 100).astype(np.int64)*10).tolist()
exclude_float = np.random.rand(100)


In [146]: df.shape
Out[146]: (1000000, 4)

In [147]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 4 columns):
float     1000000 non-null float64
int32     1000000 non-null int32
int64     1000000 non-null int64
string    1000000 non-null object
dtypes: float64(1), int32(1), int64(1), object(1)
memory usage: 26.7+ MB

In [148]: df.head()
Out[148]:
      float   int32       int64      string
0  0.221662  283447  6849265910  NNNNNNNNNN
1  0.276834  455464  8785039710  AAAAAAAAAA
2  0.517846  618887  8653293710  YYYYYYYYYY
3  0.318897  363191  2223601320  PPPPPPPPPP
4  0.323926  777875  5357201380  QQQQQQQQQQ

根据列值删除Pandas中的DataFrame行 - 要删除的多个值

3 个答案:

示范

时序

包含3个项目的130万行

130万行，不包括12项

结论