考虑df
df = pd.DataFrame(np.ones((2, 3)), columns=list('abc'))
df
col_list = list('bcd')
df[col_list]
生成错误
KeyError: "['d'] not in index"
如何获得尽可能多的列?
答案 0 :(得分:5)
使用Index.intersection()怎么办?
In [69]: df[df.columns.intersection(col_list)]
Out[69]:
b c
0 1.0 1.0
1 1.0 1.0
In [70]: df.columns
Out[70]: Index(['a', 'b', 'c'], dtype='object') # <---------- Index
<强>定时:强>
In [21]: df_ = pd.concat([df] * 10**5, ignore_index=True)
In [22]: df_.shape
Out[22]: (200000, 3)
In [23]: df.columns
Out[23]: Index(['a', 'b', 'c'], dtype='object')
In [24]: col_list = list('bcd')
In [28]: %timeit df_[df_.columns.intersection(col_list)]
100 loops, best of 3: 6.24 ms per loop
In [29]: %timeit df_[[col for col in col_list if col in df_.columns]]
100 loops, best of 3: 5.69 ms per loop
让我们在转置DF(3行,200K列)上进行测试:
In [30]: t = df_.T
In [31]: t.shape
Out[31]: (3, 200000)
In [32]: t
Out[32]:
0 1 2 3 4 ... 199995 199996 199997 199998 199999
a 1.0 1.0 1.0 1.0 1.0 ... 1.0 1.0 1.0 1.0 1.0
b 1.0 1.0 1.0 1.0 1.0 ... 1.0 1.0 1.0 1.0 1.0
c 1.0 1.0 1.0 1.0 1.0 ... 1.0 1.0 1.0 1.0 1.0
[3 rows x 200000 columns]
In [33]: col_list=[-10, -20, 10, 20, 100]
In [34]: %timeit t[t.columns.intersection(col_list)]
10 loops, best of 3: 52.8 ms per loop
In [35]: %timeit t[[col for col in col_list if col in t.columns]]
10 loops, best of 3: 103 ms per loop
结论:几乎总是列出较小列表的理解胜利,而Pandas / NumPy则胜过较大的数据集......
答案 1 :(得分:5)
怎么样:
df[[col for col in list('bcd') if col in df.columns]]
这会产生:
b c
0 1.0 1.0
1 1.0 1.0
答案 2 :(得分:1)
Index
对象支持isin
:
In [4]:
col_list = list('bcd')
df.ix[:,df.columns.isin(col_list)]
Out[4]:
b c
0 1 1
1 1 1
因此,这将针对传入的列表生成现有列的布尔掩码
<强>计时强>
In [5]:
df_ = pd.concat([df] * 10**5, ignore_index=True)
%timeit df_[df_.columns.intersection(col_list)]
%timeit df_[[col for col in col_list if col in df_.columns]]
%timeit df_.ix[:,df_.columns.isin(col_list)]
100 loops, best of 3: 12.8 ms per loop
100 loops, best of 3: 18.6 ms per loop
10 loops, best of 3: 26.6 ms per loop
这是最慢的方法,但字符较少,可能更容易理解