Question

有时，当数据导入Pandas Dataframe时，它始终导入为object类型。这很好，适用于大多数操作，但我正在尝试创建自定义导出功能，我的问题是：

有没有办法强制Pandas推断输入数据的数据类型？
如果没有，加载数据后是否有办法以某种方式推断数据类型？

我知道我可以告诉Pandas这是int，str等类型..但是我不想这样做，我希望pandas可以足够聪明地知道用户时的所有数据类型导入或添加一列。

编辑 - 导入示例

a = ['a']
col = ['somename']
df = pd.DataFrame(a, columns=col)
print(df.dtypes)
>>> somename    object
dtype: object

类型应该是字符串？

Answer 1

这只是部分答案，但您可以在整个DataFrame中获取变量中元素的数据类型的频率计数，如下所示：

dtypeCount =[df.iloc[:,i].apply(type).value_counts() for i in range(df.shape[1])]

返回

dtypeCount

[<class 'numpy.int32'>    4
 Name: a, dtype: int64,
 <class 'int'>    2
 <class 'str'>    2
 Name: b, dtype: int64,
 <class 'numpy.int32'>    4
 Name: c, dtype: int64]

它不能很好地打印，但您可以按位置提取任何变量的信息：

dtypeCount[1]

<class 'int'>    2
<class 'str'>    2
Name: b, dtype: int64

应该让您开始找到导致问题的数据类型以及有多少数据类型。

然后，您可以使用

检查第二个变量中具有str对象的行

df[df.iloc[:,1].map(lambda x: type(x) == str)]

   a  b  c
1  1  n  4
3  3  g  6

数据

df = DataFrame({'a': range(4), 'b': [6, 'n', 7, 'g'], 'c': range(3, 7)})

Answer 2

您还可以通过使用infer_objects()删除不相关的项目来推断对象。下面是一个一般示例。

df_orig = pd.DataFrame({"A": ["a", 1, 2, 3], "B": ["b", 1.2, 1.8, 1.8]})
df = df_orig.iloc[1:].infer_objects()
print(df_orig.dtypes, df.dtypes, sep='\n\n')

输出：

Answer 3

在这里（不是很完美）尝试写一个更好的推断器。当您的数据帧中有所有就绪数据时，推断者将猜测可能的smallet类型。当前缺少日期时间，但我认为这可能是一个起点。有了这个推断者，我可以减少70％的使用内存。

def infer_df(df, hard_mode=False, float_to_int=False, mf=None):
    ret = {}

    # ToDo: How much does auto convertion cost
    # set multiplication factor
    mf = 1 if hard_mode else 0.5

    # set supported datatyp
    integers = ['int8', 'int16', 'int32', 'int64']
    floats = ['float16', 'float32', 'float64']

    # ToDo: Unsigned Integer

    # generate borders for each datatype
    b_integers = [(np.iinfo(i).min, np.iinfo(i).max, i) for i in integers]
    b_floats = [(np.finfo(f).min, np.finfo(f).max, f) for f in floats]

    for c in df.columns:
        _type = df[c].dtype

        # if a column is set to float, but could be int
        if float_to_int and np.issubdtype(_type, np.floating):
            if np.sum(np.remainder(df[c], 1)) == 0:
                df[c] = df[c].astype('int64')
                _type = df[c].dtype

        # convert type of column to smallest possible
        if np.issubdtype(_type, np.integer) or np.issubdtype(_type, np.floating):
            borders = b_integers if np.issubdtype(_type, np.integer) else b_floats

            _min = df[c].min()
            _max = df[c].max()

            for b in borders:
                if b[0] * mf < _min and _max < b[1] * mf:
                    ret[c] = b[2]
                    break

        if _type == 'object' and len(df[c].unique()) / len(df) < 0.1:
            ret[c] = 'category'

    return ret

确定Pandas Column DataType

3 个答案: