Question

我正在研究使用pandas库的算法。我在工作时遇到了一个有趣的问题。

当我将数据框对象写入文件并再次读取时，数据框会发生变化。当我调查原因时，我发现它是由类型引起的。例如，我正在创建如下数据框；

import pandas as pd

d = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(d)
df.col1 = df.col1.astype('int8')

df.info()

输出看起来像这样：

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 2 columns):
col1    2 non-null int8
col2    2 non-null int64
dtypes: int64(1), int8(1)
memory usage: 98.0 bytes

只有98个字节。

我将其写入文件并再次读取。

df.to_csv('test.csv', index=False)
pd.read_csv('test.csv').info()

输出看起来像这样：

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 2 columns):
col1    2 non-null int64
col2    2 non-null int64
dtypes: int64(2)
memory usage: 112.0 bytes

现在内存使用量为112个字节。这里的问题是，在读取csv文件时，其读取为int64。我正在大型数据帧上执行此操作，我的文件大小 250 mb 达到 1.14 gb

我的问题是；有没有一种方法可以自动将数据帧上的列类型转换为尽可能小的大小？我尝试了功能 infer_dtypes ，但没有得到想要的结果。它说它应该是整数，应该是类型。

Answer 1

to_numeric有一个下投参数。因此，您可以像这样向下转换所有数字列：

df.col1 = pd.to_numeric(df.col1, downcast='integer')

示例：

import io
s = """col1,col2,col3
1,1000000,'a'
"""
df = pd.read_csv(io.StringIO(s))

df.info()
#<class 'pandas.core.frame.DataFrame'>
#RangeIndex: 1 entries, 0 to 0
#Data columns (total 3 columns):
#col1    1 non-null int64
#col2    1 non-null int64
#col3    1 non-null object
#dtypes: int64(2), object(1)
#memory usage: 84.0+ bytes

num_cols = df.select_dtypes('number').columns
df[num_cols] = df[num_cols].apply(lambda x: pd.to_numeric(x, downcast='integer'))

df.info()
#<class 'pandas.core.frame.DataFrame'>
#RangeIndex: 1 entries, 0 to 0
#Data columns (total 3 columns):
#col1    1 non-null int8
#col2    1 non-null int32
#col3    1 non-null object
#dtypes: int32(1), int8(1), object(1)
#memory usage: 73.0+ bytes

Answer 2

经过研究，to_numeric函数可以正常工作。我已经实现了自己的实现，如下所示。

我从numpy数据类型创建了一个数据框对象。

np_types = [np.int8 ,np.int16 ,np.int32, np.int64,
           np.uint8 ,np.uint16, np.uint32, np.uint64]
np_types = [np_type.__name__ for np_type in np_types]
type_df = pd.DataFrame(data=np_types, columns=['class_type'])
type_df

结果如下：

然后我将有关类型的信息添加到数据框

type_df['min_value'] = type_df['class_type'].apply(lambda row: np.iinfo(row).min)
type_df['max_value'] = type_df['class_type'].apply(lambda row: np.iinfo(row).max)
type_df['range'] = type_df['max_value'] - type_df['min_value']
type_df.sort_values(by='range', inplace=True)
type_df

然后我在整数列上编写了一个函数，以找出哪种类型在最小值和最大值上更合适。

def optimize_types(dataframe):
for col in dataframe.loc[:, dataframe.dtypes <= np.integer]:
    col_min = dataframe[col].min()
    col_max = dataframe[col].max()
    temp = type_df[(type_df['min_value'] <= col_min) & (type_df['max_value'] >= col_max)]
    optimized_class = temp.loc[temp['range'].idxmin(), 'class_type']
    print("Col name : {} Col min_value : {} Col max_value : {} Optimized Class : {}".format(col, col_min, col_max, optimized_class))
    dataframe[col] = dataframe[col].astype(optimized_class)
return dataframe

我的数据帧为2.6 gb。通过上述功能，它可以减小到600 mb。

当我使用to_numeric函数时，我得到了以下结果：

Answer 3

适用于所有数字类型，有助于摆脱 np.int64 和 np.float64：

import numbers
import pandas as pd
from typing import Optional

def auto_opt_pd_dtypes(df_: pd.DataFrame, inplace=False) -> Optional[pd.DataFrame]:
    """ Automatically downcast Number dtypes for minimal possible,
        will not touch other (datetime, str, object, etc)
        
        :param df_: dataframe
        :param inplace: if False, will return a copy of input dataset
        
        :return: `None` if `inplace=True` or dataframe if `inplace=False`
    """
    df = df_ if inplace else df_.copy()
        
    for col in df.columns:
        # integers
        if issubclass(df[col].dtypes.type, numbers.Integral):
            # unsigned integers
            if df[col].min() >= 0:
                df[col] = pd.to_numeric(df[col], downcast='unsigned')
            # signed integers
            else:
                df[col] = pd.to_numeric(df[col], downcast='integer')
        # other real numbers
        elif issubclass(df[col].dtypes.type, numbers.Real):
            df[col] = pd.to_numeric(df[col], downcast='float')
    
    if not inplace:
        return df

用法：

# return optimized copy
df_opt = auto_opt_pd_dtypes(df)
# or optimize in place
auto_opt_pd_dtypes(df, inplace=True)

Answer 4

如果列都是数字，则可以执行以下操作：

import numpy as np
df = df.astype(np.int8)

如果列不是全部为数字列，则可以首先对其进行切片，选择数字列，然后调用astype。

Answer 5

一种选择是使用可以序列化python对象的文件类型。 dtypes现在将继续存在。在这里，我使用pickle。对于大型DataFrame，这也可能导致IO操作占用大量performance improvement。

import pandas as pd
d = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(d)
df['col1'] = df.col1.astype('int8')

df.info()
#<class 'pandas.core.frame.DataFrame'>
#RangeIndex: 2 entries, 0 to 1
#Data columns (total 2 columns):
#col1    2 non-null int8
#col2    2 non-null int64
#dtypes: int64(1), int8(1)
#memory usage: 146.0 bytes

df.to_pickle('test.pkl')
pd.read_pickle('test.pkl').info()
#<class 'pandas.core.frame.DataFrame'>
#RangeIndex: 2 entries, 0 to 1
#Data columns (total 2 columns):
#col1    2 non-null int8
#col2    2 non-null int64
#dtypes: int64(1), int8(1)
#memory usage: 146.0 bytes

另一个选择是坚持使用csv，但保持{'col_name: 'dtype'}的架构。每当您读取文件时，都可以使用它。

schema = {'col1': 'int8'}
df.to_csv('test.csv', index=False)
pd.read_csv('test.csv', dtype=schema).info()
#<class 'pandas.core.frame.DataFrame'>
#RangeIndex: 2 entries, 0 to 1
#Data columns (total 2 columns):
#col1    2 non-null int8
#col2    2 non-null int64
#dtypes: int64(1), int8(1)
#memory usage: 146.0 bytes

自动优化熊猫Dtypes

5 个答案: