Question

我需要使用不同的函数来处理数字列和字符串列。我现在正在做的事情真是愚蠢：

allc = list((agg.loc[:, (agg.dtypes==np.float64)|(agg.dtypes==np.int)]).columns)
for y in allc:
    treat_numeric(agg[y])    

allc = list((agg.loc[:, (agg.dtypes!=np.float64)&(agg.dtypes!=np.int)]).columns)
for y in allc:
    treat_str(agg[y])

有更优雅的方法吗？ E.g。

for y in agg.columns:
    if(dtype(agg[y]) == 'string'):
          treat_str(agg[y])
    elif(dtype(agg[y]) != 'string'):
          treat_numeric(agg[y])

Answer 1

您可以使用dtype

访问列的数据类型

for y in agg.columns:
    if(agg[y].dtype == np.float64 or agg[y].dtype == np.int64):
          treat_numeric(agg[y])
    else:
          treat_str(agg[y])

Answer 2

在pandas 0.20.2中你可以这样做：

from pandas.api.types import is_string_dtype
from pandas.api.types import is_numeric_dtype

is_string_dtype(df['A'])
>>>> True

is_numeric_dtype(df['B'])
>>>> True

所以你的代码变成了：

for y in agg.columns:
    if (is_string_dtype(agg[y])):
        treat_str(agg[y])
    elif (is_numeric_dtype(agg[y])):
        treat_numeric(agg[y])

Answer 3

我知道这是一个旧线程，但是使用pandas 19.02，你可以这样做：

df.select_dtypes(include=['float64']).apply(your_function)
df.select_dtypes(exclude=['string','object']).apply(your_other_function)

http://pandas.pydata.org/pandas-docs/version/0.19.2/generated/pandas.DataFrame.select_dtypes.html

Answer 4

如果要将数据框列的类型标记为字符串，可以执行以下操作：

df['A'].dtype.kind

一个例子：

In [8]: df = pd.DataFrame([[1,'a',1.2],[2,'b',2.3]])
In [9]: df[0].dtype.kind, df[1].dtype.kind, df[2].dtype.kind
Out[9]: ('i', 'O', 'f')

您的代码的答案：

for y in agg.columns:
    if(agg[y].dtype.kind == 'f' or agg[y].dtype.kind == 'i'):
          treat_numeric(agg[y])
    else:
          treat_str(agg[y])

Answer 5

要漂亮地打印列数据类型

例如在从文件导入后检查数据类型

def printColumnInfo(df):
    template="%-8s %-30s %s"
    print(template % ("Type", "Column Name", "Example Value"))
    print("-"*53)
    for c in df.columns:
        print(template % (df[c].dtype, c, df[c].iloc[1]) )

说明性输出：

Type     Column Name                    Example Value
-----------------------------------------------------
int64    Age                            49
object   Attrition                      No
object   BusinessTravel                 Travel_Frequently
float64  DailyRate                      279.0

Answer 6

所提出的问题标题是一般性的，但问题正文中所述的作者用例是特定的。因此，可以使用任何其他答案。

但是为了完全回答 title问题 ，应该澄清的是，在某些情况下，所有方法似乎都可能失败并需要一些返工。我对所有这些（以及其他一些）以降低可靠性的顺序进行了审核（我认为）：

1。直接通过`==`比较类型（可接受的答案）。

尽管这是公认的答案，并且投票最多，但我认为完全不应该使用此方法。因为实际上这种方法在python中不推荐使用，正如多次here所提到的。
但是，如果仍然要使用它，则应注意一些特定于熊猫的dtype，例如pd.CategoricalDType，pd.PeriodDtype或pd.IntervalDtype。这里必须使用额外的type( )才能正确识别dtype：

s = pd.Series([pd.Period('2002-03','D'), pd.Period('2012-02-01', 'D')])
s
s.dtype == pd.PeriodDtype   # Not working
type(s.dtype) == pd.PeriodDtype # working 

>>> 0    2002-03-01
>>> 1    2012-02-01
>>> dtype: period[D]
>>> False
>>> True

这里的另一个警告是应该精确指出类型：

s = pd.Series([1,2])
s
s.dtype == np.int64 # Working
s.dtype == np.int32 # Not working

>>> 0    1
>>> 1    2
>>> dtype: int64
>>> True
>>> False

2。 `isinstance()`方法。

到目前为止，尚未在答案中提及此方法。

因此，如果直接比较类型不是一个好主意-为此，请尝试使用内置的python函数，即-isinstance()。
由于假设我们有一些对象，所以它在一开始就失败了，但是pd.Series或pd.DataFrame可以仅用作具有预定义dtype但其中没有对象的空容器：

s = pd.Series([], dtype=bool)
s

>>> Series([], dtype: bool)

但是，如果有人以某种方式克服了这个问题，并且想要访问每个对象，例如在第一行中，并像这样检查其dtype：

df = pd.DataFrame({'int': [12, 2], 'dt': [pd.Timestamp('2013-01-02'), pd.Timestamp('2016-10-20')]},
                  index = ['A', 'B'])
for col in df.columns:
    df[col].dtype, 'is_int64 = %s' % isinstance(df.loc['A', col], np.int64)

>>> (dtype('int64'), 'is_int64 = True')
>>> (dtype('<M8[ns]'), 'is_int64 = False')

在单列中使用混合类型的数据时会产生误导作用：

df2 = pd.DataFrame({'data': [12, pd.Timestamp('2013-01-02')]},
                  index = ['A', 'B'])
for col in df2.columns:
    df2[col].dtype, 'is_int64 = %s' % isinstance(df2.loc['A', col], np.int64)

>>> (dtype('O'), 'is_int64 = False')

最后但并非最不重要的一点-此方法无法直接识别Category dtype。如docs中所述：

从分类数据中返回单个项目也将返回该值，而不是长度为“ 1”的分类。

df['int'] = df['int'].astype('category')
for col in df.columns:
    df[col].dtype, 'is_int64 = %s' % isinstance(df.loc['A', col], np.int64)

>>> (CategoricalDtype(categories=[2, 12], ordered=False), 'is_int64 = True')
>>> (dtype('<M8[ns]'), 'is_int64 = False')

所以这种方法几乎也不适用。

3。 `df.dtype.kind`方法。

此方法可能仍适用于空pd.Series或pd.DataFrames，但还有另一个问题。

首先-无法区分某些dtype：

df = pd.DataFrame({'prd'  :[pd.Period('2002-03','D'), pd.Period('2012-02-01', 'D')],
                   'str'  :['s1', 's2'],
                   'cat'  :[1, -1]})
df['cat'] = df['cat'].astype('category')
for col in df:
    # kind will define all columns as 'Object'
    print (df[col].dtype, df[col].dtype.kind)

>>> period[D] O
>>> object O
>>> category O

第二，实际上我仍然不清楚，它甚至返回某些dtypes None。

4。 `df.select_dtypes`方法。

这几乎是我们想要的。此方法在pandas内部设计，因此它可以处理前面提到的大多数极端情况-空的DataFrame，与numpy或pandas特定的dtypes完全不同。它与.select_dtypes('bool')之类的单个dtype一起使用效果很好。它甚至可以用于基于dtype选择列组：

test = pd.DataFrame({'bool' :[False, True], 'int64':[-1,2], 'int32':[-1,2],'float': [-2.5, 3.4],
                     'compl':np.array([1-1j, 5]),
                     'dt'   :[pd.Timestamp('2013-01-02'), pd.Timestamp('2016-10-20')],
                     'td'   :[pd.Timestamp('2012-03-02')- pd.Timestamp('2016-10-20'),
                              pd.Timestamp('2010-07-12')- pd.Timestamp('2000-11-10')],
                     'prd'  :[pd.Period('2002-03','D'), pd.Period('2012-02-01', 'D')],
                     'intrv':pd.arrays.IntervalArray([pd.Interval(0, 0.1), pd.Interval(1, 5)]),
                     'str'  :['s1', 's2'],
                     'cat'  :[1, -1],
                     'obj'  :[[1,2,3], [5435,35,-52,14]]
                    })
test['int32'] = test['int32'].astype(np.int32)
test['cat'] = test['cat'].astype('category')

就像docs所述：

test.select_dtypes('number')

>>>     int64   int32   float   compl   td
>>> 0      -1      -1   -2.5    (1-1j)  -1693 days
>>> 1       2       2    3.4    (5+0j)   3531 days

On可能会认为这里我们看到的第一个意外结果（过去对我来说是question）-TimeDelta被包含在输出DataFrame中。但是与answered相反，情况应该如此，但是必须意识到这一点。请注意，跳过了bool dtype，这对于某些人来说也是不希望的，但这是由于bool和number处于不同的numpy dtype中。如果是bool，我们可以在此处使用test.select_dtypes(['bool'])。

此方法的下一个限制是，对于当前版本的熊猫（0.24.2），此代码：test.select_dtypes('period')将引发NotImplementedError。

另一件事是，它不能将字符串与其他对象不同：

test.select_dtypes('object')

>>>     str     obj
>>> 0    s1     [1, 2, 3]
>>> 1    s2     [5435, 35, -52, 14]

但这是，首先-在文档中已经subtrees。其次-这不是此方法的问题，而是字符串在DataFrame中存储的方式。但是无论如何，这种情况必须进行一些后期处理。

5。 `df.api.types.is_XXX_dtype`方法。

我想这是实现dtype识别（函数所在的模块的路径本身说）的最健壮和本机的方法。它几乎可以完美运行，但是仍然有mentioned。

此外，它可能是主观的，但是与number相比，这种方法还具有更多的“人类可理解的” .select_dtypes('number') dtypes组处理：

for col in test.columns:
    if pd.api.types.is_numeric_dtype(test[col]):
        print (test[col].dtype)

>>> bool
>>> int64
>>> int32
>>> float64
>>> complex128

不包含timedelta和bool。完美。

我的管道此时恰好利用了此功能，还有一些后期处理。

输出。

希望我能论证要点-可以使用所有讨论的方法，但只应使用 pd.DataFrame.select_dtypes() 和 pd.api.types.is_XXX_dtype 真的被认为是适用的。

P.s .：希望我在所有测试中也没有犯太多错误：）

如何检查python pandas中列的dtype

6 个答案:

要漂亮地打印列数据类型

1。直接通过`==`比较类型（可接受的答案）。

2。 `isinstance()`方法。

3。 `df.dtype.kind`方法。

4。 `df.select_dtypes`方法。

5。 `df.api.types.is_XXX_dtype`方法。

输出。

如何检查python pandas中列的dtype

6 个答案:

要漂亮地打印列数据类型

1。直接通过==比较类型（可接受的答案）。

2。 isinstance()方法。

3。 df.dtype.kind方法。

4。 df.select_dtypes方法。

5。 df.api.types.is_XXX_dtype方法。

输出。

1。直接通过`==`比较类型（可接受的答案）。

2。 `isinstance()`方法。

3。 `df.dtype.kind`方法。

4。 `df.select_dtypes`方法。

5。 `df.api.types.is_XXX_dtype`方法。