嘿,我想尽可能安全快速地向下转换数据帧的数据类型。数据框在列中可以具有混合dtypes的任何组合,大多数是np.nan或字符串'NaN's的字符串列。具有nans的整数列将转换为pandas 24.2数据类型'Int8','Int16'...空列表和空字典似乎会导致转换和向下转换失败,因为它们被转换为floats(Why ??),所以我排除了他们。我的方法有效,但是我不敢相信没有其他简单的方法可以做到这一点,尤其是没有更快的解决方案。我想了很久,我需要一种快速的方法,因为我处理的数据帧可能是1000000x150个单元格。
我的方法:
def convertAndDowncast(column,downcast=True):
try:
column = pd.to_numeric(column, downcast='float',errors='ignore')
if downcast==True:
column = pd.to_numeric(column, downcast='integer',errors='ignore')
if column.dtype == 'int8':
column = column.astype('Int8',casting='safe')
elif column.dtype == 'int16':
column = column.astype('Int16',casting='safe')
elif column.dtype == 'int32' or column.dtype == 'int64':
column = column.astype('Int32',casting='safe')
except Exception as e:
print(e)
return column
finally:
return column
def dtypeCorrection(df,downcast=True):
if isinstance(df,pd.DataFrame):
maskOfNans = df.isnull().values
array = df.values
excludedColumns = set(df.columns[(df.applymap(type) == list).any(0)]) | set(df.columns[(df.applymap(type) == dict).any(0)])
maskOfStringNans = ((((array=='nan')|(array == 'NaN'))|(array =='NaT'))|(array == 'None'))
combinedMasks = maskOfNans|maskOfStringNans
array[combinedMasks] = 0
df[df.columns] = array
for column in df[set(df)-excludedColumns]:
df[column] = convertAndDowncast(df[column],downcast=downcast)
df = df.mask(combinedMasks, np.nan)
return df
测试:
df = pd.DataFrame.from_dict({0:{'integerColumn':1,'strColumn':'test0','floatColumn':0.1,'strIntegerColumn':'0','strFloatColumn':'0.1',
'strObjectColumn':'[1,2,3]','objectColumn':[1,2,3],'strIntegerColumn2':'1','strFloatColumn2':'0.2',
'testColumn':{},'testColumn2':[],'testColumn3':[1,2,3]},
1:{'integerColumn':np.nan,'strColumn':'test1','floatColumn':np.nan,'strIntegerColumn':'NaN','strFloatColumn':'nan',
'strObjectColumn':'NaN','objectColumn':np.nan,'strIntegerColumn2':np.nan,'strFloatColumn2':np.nan,
'testColumn':{},'testColumn2':[],'testColumn3':[1,2,3]}},orient='index')
dtypeCorrection(df,downcast=True)
dtypes的输出:
testColumn3 object
integerColumn Int8
strObjectColumn object
strFloatColumn2 float32
strIntegerColumn Int8
strIntegerColumn2 Int8
testColumn2 object
testColumn object
strFloatColumn float32
objectColumn object
floatColumn float32
strColumn object