Question

鉴于具有不同数据类型的Panda df，df.select_dtypes对于仅保留所需列或删除特定应用程序的不需要列非常有用。

但是，使用此方法似乎无法解决string dtypes。

来自the docs（强调我的）：

ValueError异常
  如果include和exclude都是空的   如果包含和排除具有重叠元素
  如果传入任何类型的字符串dtype。

和

要选择字符串，您必须使用对象dtype，但请注意，这将返回所有对象dtype列

确实，使用df.select_dtypes(exclude=['str'])会引发错误（虽然它是TypeError而不是文档声明的ValueError，并且使用df.select_dtypes(exclude=['object'])会删除所有object }列，而不仅仅是string列。

如果df这样：

df = pd.DataFrame({'int_col':[0,1,2,3,4],
                   'dict_col':[dict() for i in range(5)],
                   'str_col':list('abcde')})

并考虑到

df.dtypes

对于object和str_col

，

为dict_col

排除或包含所有字符串列的最佳方法是什么？

Answer 1

选项1

使用df.applymap和type，等同于str：

In [377]: (df.applymap(type) == str).all(0)
Out[377]: 
dict_col    False
int_col     False
str_col      True
dtype: bool

每列中的每个元素都转换为其类型，然后等同于str。之后，只需致电.all(0)或.min(0)即可获得每栏判决。

选项2

使用df.applymap和isinstance：

In [342]: df.applymap(lambda x: isinstance(x, str)).all(0)
Out[342]: 
dict_col    False
int_col     False
str_col      True

要包含这些字符串列，您可以对列进行布尔索引：

idx = ... # one of the two methods above
df_new = df[df.columns[idx]]

排除将是

df_new = df[df.columns[~idx]]

Answer 2

来自熊猫资源：

~/.local/share/virtualenvs/.../lib/python3.7/site-packages/pandas/core/dtypes/cast.py in invalidate_string_dtypes(dtype_set)
    851     non_string_dtypes = dtype_set - {np.dtype("S").type, np.dtype("<U").type}
    852     if non_string_dtypes != dtype_set:
--> 853         raise TypeError("string dtypes are not allowed, use 'object' instead")
    854 
    855

显然，最佳做法是改用object。所以，只要

df.select_dtypes(exclude=['object'])

假设这些列中没有实际的Python对象，但类型为str。

Answer 3

您可以使用函数hasattr来检查列是否具有属性str：

str_cols = [col for col in df.columns if hasattr(df[col], 'str')]
df[str_cols]

在pandas df中选择字符串列（相当于df.select_dtypes）

3 个答案: