Python:尽快下移数据帧

时间:2019-07-11 13:52:33

标签: python pandas numpy types

嘿,我想尽可能安全快速地向下转换数据帧的数据类型。数据框在列中可以具有混合dtypes的任何组合,大多数是np.nan或字符串'NaN's的字符串列。具有nans的整数列将转换为pandas 24.2数据类型'Int8','Int16'...空列表和空字典似乎会导致转换和向下转换失败,因为它们被转换为floats(Why ??),所以我排除了他们。我的方法有效,但是我不敢相信没有其他简单的方法可以做到这一点,尤其是没有更快的解决方案。我想了很久,我需要一种快速的方法,因为我处理的数据帧可能是1000000x150个单元格。

我的方法:

def convertAndDowncast(column,downcast=True):
    try:
        column = pd.to_numeric(column, downcast='float',errors='ignore')
        if downcast==True:
            column = pd.to_numeric(column, downcast='integer',errors='ignore')
            if column.dtype == 'int8':
                column = column.astype('Int8',casting='safe')
            elif column.dtype == 'int16':
                column = column.astype('Int16',casting='safe')
            elif column.dtype == 'int32' or column.dtype == 'int64':
                column = column.astype('Int32',casting='safe')
    except Exception as e:
        print(e)
        return column
    finally:
        return column


def dtypeCorrection(df,downcast=True):
    if isinstance(df,pd.DataFrame): 
        maskOfNans = df.isnull().values
        array = df.values
        excludedColumns = set(df.columns[(df.applymap(type) == list).any(0)]) | set(df.columns[(df.applymap(type) == dict).any(0)])
        maskOfStringNans = ((((array=='nan')|(array == 'NaN'))|(array =='NaT'))|(array == 'None'))
        combinedMasks = maskOfNans|maskOfStringNans
        array[combinedMasks] = 0
        df[df.columns] = array
        for column in df[set(df)-excludedColumns]:
            df[column] = convertAndDowncast(df[column],downcast=downcast)
        df = df.mask(combinedMasks, np.nan)
    return df

测试:

df = pd.DataFrame.from_dict({0:{'integerColumn':1,'strColumn':'test0','floatColumn':0.1,'strIntegerColumn':'0','strFloatColumn':'0.1',
                                'strObjectColumn':'[1,2,3]','objectColumn':[1,2,3],'strIntegerColumn2':'1','strFloatColumn2':'0.2',
                                'testColumn':{},'testColumn2':[],'testColumn3':[1,2,3]},
                            1:{'integerColumn':np.nan,'strColumn':'test1','floatColumn':np.nan,'strIntegerColumn':'NaN','strFloatColumn':'nan',
                               'strObjectColumn':'NaN','objectColumn':np.nan,'strIntegerColumn2':np.nan,'strFloatColumn2':np.nan,
                               'testColumn':{},'testColumn2':[],'testColumn3':[1,2,3]}},orient='index')

dtypeCorrection(df,downcast=True)

dtypes的输出:

testColumn3           object
integerColumn           Int8
strObjectColumn       object
strFloatColumn2      float32
strIntegerColumn        Int8
strIntegerColumn2       Int8
testColumn2           object
testColumn            object
strFloatColumn       float32
objectColumn          object
floatColumn          float32
strColumn             object

0 个答案:

没有答案