有效地将大型Pandas DataFrame列从float转换为int

时间:2019-10-21 16:52:53

标签: python pandas dataframe

这与mailing-list不同,因为我需要保留NaN值,因此我选择使用实验性error using astype when NaN exists in a dataframe。这个问题的症结在于试图避免循环。

我们有许多大型医学数据集,我将从SAS导入到Pandas中。大多数字段是枚举类型,应表示为整数,但由于很多包含NaN值,因此它们以float64形式出现。 Pandas中的实验性IntegerArray类型解决了NaN问题。但是,这些数据集非常大,我想根据数据本身在脚本中进行转换。以下脚本可以运行,但是速度非常慢,我已经找到了一种更Pythonic或“ Pandorable”的编写方式。

# Convert any non-float fields to IntegerArray (Int)
# Note than IntegerArrays are an experimental addition in Pandas 0.24. They
# allow integer columns to contain NaN fields like float columns.
#
# This is a rather brute-force technique that loops through every column
# and every row. There's got to be a more efficient way to do it since it 
# takes a long time and uses up a lot of memory.
def convert_integer (df):
    for col in df.columns:
        intcol_flag = True
        if df[col].dtype == 'float64':   # Assuming dtype is "float64"
            # TODO: Need to remove inner loop - SLOW!
            for val in df[col]:
                # If not NaN and the int() value is different from
                # the float value, then we have an actual float.
                if pd.notnull(val) and abs(val - int(val)) > 1e-6:
                    intcol_flag = False
                    break;
            # If not a float, change it to an Int based on size
            if intcol_flag:
                if df[col].abs().max() < 127:
                    df[col] = df[col].astype('Int8')
                elif df[col].abs().max() < 32767:
                    df[col] = df[col].astype('Int16')
                else:   # assuming no ints greater than 2147483647 
                    df[col] = df[col].astype('Int32') 
        print(f"{col} is {df[col].dtype}")
    return df

我认为内部的for循环是问题所在,但我尝试将其替换为:

            s = df[col].apply(lambda x: pd.notnull(x) and abs(x - int(x)) > 1e-6)
            if s.any():
                intcol_flag = False

它仍然一样慢。

以下是一些示例数据和所需的输出:

np.random.seed(10)
df = pd.DataFrame(np.random.choice([1, 2, 3.3, 5000, 111111, np.NaN], (3,9)), 
                  columns=[f'col{i}' for i in range(9)])
df

    col0    col1    col2    col3    col4    col5    col6    col7    col8
0   2.0     NaN   111111.0  1.0     2.0   5000.0  111111.0  2.0     NaN
1   1.0     NaN     2.0     3.3     1.0      2.0     1.0    3.3     1.0
2  111111.0 5000.0  1.0   111111.0  5000.0   1.0    5000.0  3.3     2.0

结果应该是:

col0 is Int32
col1 is Int16
col2 is Int32
col3 is float64
col4 is Int16
col5 is Int16
col6 is Int32
col7 is float64
col8 is Int8

1 个答案:

答案 0 :(得分:2)

找到需要键入每种类型的列,然后针对每种类型一次完成所有操作。

样本数据

import pandas as pd
import numpy as np

np.random.seed(10)
df = pd.DataFrame(np.random.choice([1, 2, 3.3, 5000, 111111, np.NaN], (3,9)), 
                  columns=[f'col{i}' for i in range(9)])

代码

s = pd.cut(df.max(), bins=[0, 127, 32767, 2147483647], labels=['Int8', 'Int16', 'Int32'])
s = s.where((df.dtypes=='float') & (df.isnull() | (df%1 == 0)).all())
            # Cast previously       # If all values are 
            # float columns         # "I"nteger-like

for idx, gp in s.groupby(s):
    df.loc[:, gp.index] = df.loc[:, gp.index].astype(idx)

df.dtypes
#col0      Int32
#col1      Int16
#col2      Int32
#col3    float64
#col4      Int16
#col5      Int16
#col6      Int32
#col7    float64
#col8       Int8
#dtype: object

print(df)
#     col0  col1    col2      col3  col4  col5    col6  col7  col8
#0       2   NaN  111111       1.0     2  5000  111111   2.0   NaN
#1       1   NaN       2       3.3     1     2       1   3.3     1
#2  111111  5000       1  111111.0  5000     1    5000   3.3     2