这与mailing-list不同,因为我需要保留NaN值,因此我选择使用实验性error using astype when NaN exists in a dataframe。这个问题的症结在于试图避免循环。
我们有许多大型医学数据集,我将从SAS导入到Pandas中。大多数字段是枚举类型,应表示为整数,但由于很多包含NaN值,因此它们以float64形式出现。 Pandas中的实验性IntegerArray类型解决了NaN问题。但是,这些数据集非常大,我想根据数据本身在脚本中进行转换。以下脚本可以运行,但是速度非常慢,我已经找到了一种更Pythonic或“ Pandorable”的编写方式。
# Convert any non-float fields to IntegerArray (Int)
# Note than IntegerArrays are an experimental addition in Pandas 0.24. They
# allow integer columns to contain NaN fields like float columns.
#
# This is a rather brute-force technique that loops through every column
# and every row. There's got to be a more efficient way to do it since it
# takes a long time and uses up a lot of memory.
def convert_integer (df):
for col in df.columns:
intcol_flag = True
if df[col].dtype == 'float64': # Assuming dtype is "float64"
# TODO: Need to remove inner loop - SLOW!
for val in df[col]:
# If not NaN and the int() value is different from
# the float value, then we have an actual float.
if pd.notnull(val) and abs(val - int(val)) > 1e-6:
intcol_flag = False
break;
# If not a float, change it to an Int based on size
if intcol_flag:
if df[col].abs().max() < 127:
df[col] = df[col].astype('Int8')
elif df[col].abs().max() < 32767:
df[col] = df[col].astype('Int16')
else: # assuming no ints greater than 2147483647
df[col] = df[col].astype('Int32')
print(f"{col} is {df[col].dtype}")
return df
我认为内部的for循环是问题所在,但我尝试将其替换为:
s = df[col].apply(lambda x: pd.notnull(x) and abs(x - int(x)) > 1e-6)
if s.any():
intcol_flag = False
它仍然一样慢。
以下是一些示例数据和所需的输出:
np.random.seed(10)
df = pd.DataFrame(np.random.choice([1, 2, 3.3, 5000, 111111, np.NaN], (3,9)),
columns=[f'col{i}' for i in range(9)])
df
col0 col1 col2 col3 col4 col5 col6 col7 col8
0 2.0 NaN 111111.0 1.0 2.0 5000.0 111111.0 2.0 NaN
1 1.0 NaN 2.0 3.3 1.0 2.0 1.0 3.3 1.0
2 111111.0 5000.0 1.0 111111.0 5000.0 1.0 5000.0 3.3 2.0
结果应该是:
col0 is Int32
col1 is Int16
col2 is Int32
col3 is float64
col4 is Int16
col5 is Int16
col6 is Int32
col7 is float64
col8 is Int8
答案 0 :(得分:2)
找到需要键入每种类型的列,然后针对每种类型一次完成所有操作。
import pandas as pd
import numpy as np
np.random.seed(10)
df = pd.DataFrame(np.random.choice([1, 2, 3.3, 5000, 111111, np.NaN], (3,9)),
columns=[f'col{i}' for i in range(9)])
s = pd.cut(df.max(), bins=[0, 127, 32767, 2147483647], labels=['Int8', 'Int16', 'Int32'])
s = s.where((df.dtypes=='float') & (df.isnull() | (df%1 == 0)).all())
# Cast previously # If all values are
# float columns # "I"nteger-like
for idx, gp in s.groupby(s):
df.loc[:, gp.index] = df.loc[:, gp.index].astype(idx)
df.dtypes
#col0 Int32
#col1 Int16
#col2 Int32
#col3 float64
#col4 Int16
#col5 Int16
#col6 Int32
#col7 float64
#col8 Int8
#dtype: object
print(df)
# col0 col1 col2 col3 col4 col5 col6 col7 col8
#0 2 NaN 111111 1.0 2 5000 111111 2.0 NaN
#1 1 NaN 2 3.3 1 2 1 3.3 1
#2 111111 5000 1 111111.0 5000 1 5000 3.3 2