获取包含列名列表的数据帧切片,其中并非所有列都在数据帧中

时间:2016-10-25 17:29:21

标签: python pandas numpy

考虑df

df = pd.DataFrame(np.ones((2, 3)), columns=list('abc'))
df

enter image description here

col_list = list('bcd')

df[col_list]

生成错误

KeyError: "['d'] not in index"

如何获得尽可能多的列?

enter image description here

3 个答案:

答案 0 :(得分:5)

使用Index.intersection()怎么办?

In [69]: df[df.columns.intersection(col_list)]
Out[69]:
     b    c
0  1.0  1.0
1  1.0  1.0

In [70]: df.columns
Out[70]: Index(['a', 'b', 'c'], dtype='object')  # <---------- Index

<强>定时:

In [21]: df_ = pd.concat([df] * 10**5, ignore_index=True)

In [22]: df_.shape
Out[22]: (200000, 3)

In [23]: df.columns
Out[23]: Index(['a', 'b', 'c'], dtype='object')

In [24]: col_list = list('bcd')

In [28]: %timeit df_[df_.columns.intersection(col_list)]
100 loops, best of 3: 6.24 ms per loop

In [29]: %timeit df_[[col for col in col_list if col in df_.columns]]
100 loops, best of 3: 5.69 ms per loop

让我们在转置DF(3行,200K列)上进行测试:

In [30]: t = df_.T

In [31]: t.shape
Out[31]: (3, 200000)

In [32]: t
Out[32]:
   0       1       2       3       4        ...    199995  199996  199997  199998  199999
a     1.0     1.0     1.0     1.0     1.0   ...       1.0     1.0     1.0     1.0     1.0
b     1.0     1.0     1.0     1.0     1.0   ...       1.0     1.0     1.0     1.0     1.0
c     1.0     1.0     1.0     1.0     1.0   ...       1.0     1.0     1.0     1.0     1.0

[3 rows x 200000 columns]

In [33]: col_list=[-10, -20, 10, 20, 100]

In [34]: %timeit t[t.columns.intersection(col_list)]
10 loops, best of 3: 52.8 ms per loop

In [35]: %timeit t[[col for col in col_list if col in t.columns]]
10 loops, best of 3: 103 ms per loop

结论:几乎总是列出较小列表的理解胜利,而Pandas / NumPy则胜过较大的数据集......

答案 1 :(得分:5)

怎么样:

df[[col for col in list('bcd') if col in df.columns]]

这会产生:

     b    c
0  1.0  1.0
1  1.0  1.0

答案 2 :(得分:1)

Index对象支持isin

In [4]:    
col_list = list('bcd')
df.ix[:,df.columns.isin(col_list)]

Out[4]:
   b  c
0  1  1
1  1  1

因此,这将针对传入的列表生成现有列的布尔掩码

<强>计时

In [5]:
df_ = pd.concat([df] * 10**5, ignore_index=True)
%timeit df_[df_.columns.intersection(col_list)]
%timeit df_[[col for col in col_list if col in df_.columns]]
%timeit df_.ix[:,df_.columns.isin(col_list)]

100 loops, best of 3: 12.8 ms per loop
100 loops, best of 3: 18.6 ms per loop
10 loops, best of 3: 26.6 ms per loop

这是最慢的方法,但字符较少,可能更容易理解