Question

考虑df

df = pd.DataFrame(np.ones((2, 3)), columns=list('abc'))
df

col_list = list('bcd')

df[col_list]

生成错误

KeyError: "['d'] not in index"

如何获得尽可能多的列？

Answer 1

使用Index.intersection()怎么办？

In [69]: df[df.columns.intersection(col_list)]
Out[69]:
     b    c
0  1.0  1.0
1  1.0  1.0

In [70]: df.columns
Out[70]: Index(['a', 'b', 'c'], dtype='object')  # <---------- Index

<强>定时：

In [21]: df_ = pd.concat([df] * 10**5, ignore_index=True)

In [22]: df_.shape
Out[22]: (200000, 3)

In [23]: df.columns
Out[23]: Index(['a', 'b', 'c'], dtype='object')

In [24]: col_list = list('bcd')

In [28]: %timeit df_[df_.columns.intersection(col_list)]
100 loops, best of 3: 6.24 ms per loop

In [29]: %timeit df_[[col for col in col_list if col in df_.columns]]
100 loops, best of 3: 5.69 ms per loop

让我们在转置DF（3行，200K列）上进行测试：

In [30]: t = df_.T

In [31]: t.shape
Out[31]: (3, 200000)

In [32]: t
Out[32]:
   0       1       2       3       4        ...    199995  199996  199997  199998  199999
a     1.0     1.0     1.0     1.0     1.0   ...       1.0     1.0     1.0     1.0     1.0
b     1.0     1.0     1.0     1.0     1.0   ...       1.0     1.0     1.0     1.0     1.0
c     1.0     1.0     1.0     1.0     1.0   ...       1.0     1.0     1.0     1.0     1.0

[3 rows x 200000 columns]

In [33]: col_list=[-10, -20, 10, 20, 100]

In [34]: %timeit t[t.columns.intersection(col_list)]
10 loops, best of 3: 52.8 ms per loop

In [35]: %timeit t[[col for col in col_list if col in t.columns]]
10 loops, best of 3: 103 ms per loop

结论：几乎总是列出较小列表的理解胜利，而Pandas / NumPy则胜过较大的数据集......

Answer 2

怎么样：

df[[col for col in list('bcd') if col in df.columns]]

这会产生：

     b    c
0  1.0  1.0
1  1.0  1.0

Answer 3

Index对象支持isin：

In [4]:    
col_list = list('bcd')
df.ix[:,df.columns.isin(col_list)]

Out[4]:
   b  c
0  1  1
1  1  1

因此，这将针对传入的列表生成现有列的布尔掩码

<强>计时

In [5]:
df_ = pd.concat([df] * 10**5, ignore_index=True)
%timeit df_[df_.columns.intersection(col_list)]
%timeit df_[[col for col in col_list if col in df_.columns]]
%timeit df_.ix[:,df_.columns.isin(col_list)]

100 loops, best of 3: 12.8 ms per loop
100 loops, best of 3: 18.6 ms per loop
10 loops, best of 3: 26.6 ms per loop

这是最慢的方法，但字符较少，可能更容易理解

获取包含列名列表的数据帧切片，其中并非所有列都在数据帧中

3 个答案: