来自熊猫铸造的h2o框架

时间:2018-04-13 18:30:09

标签: python pandas casting h2o

我正在使用h2o从python执行预测建模。 我使用pandas从csv加载了一些数据,指定了一些列类型:

dtype_dict = {'SIT_SSICCOMP':'object',
              'SIT_CAPACC':'object',
              'PTT_SSIRMPOL':'object',
              'PTT_SPTCLVEI':'object',
              'cap_pad':'object',
              'SIT_SADNS_RESP_PERC':'object',
              'SIT_GEOCODE':'object',
              'SIT_TIPOFIRMA':'object',
              'SIT_TPFRODESI':'object',
              'SIT_CITTAACC':'object',
              'SIT_INDIRACC':'object',
              'SIT_NUMCIVACC':'object'
              }
date_cols = ["SIT_SSIDTSIN","SIT_SSIDTDEN","PTT_SPTDTEFF","PTT_SPTDTSCA","SIT_DTANTIFRODE","PTT_DTELABOR"]


columns_to_drop = ['SIT_TPFRODESI','SIT_CITTAACC',
       'SIT_INDIRACC', 'SIT_NUMCIVACC', 'SIT_CAPACC', 'SIT_LONGITACC',
       'SIT_LATITACC','cap_pad','SIT_DTANTIFRODE']


comp='mycomp'

file_completo = os.path.join(dataDir,"db4modelrisk_"+comp+".csv")
db4scoring = pd.read_csv(filepath_or_buffer=file_completo,sep=";", encoding='latin1',
                          header=0,infer_datetime_format =True,na_values=[''], keep_default_na =False,
                          parse_dates=date_cols,dtype=dtype_dict,nrows=500e3)
db4scoring.drop(labels=columns_to_drop,axis=1,inplace =True)

然后,在我设置了一个h2o集群之后,我使用db4scoring_h2o = H2OFrame(db4scoring)在h2o中导入它,然后我在因子中转换分类预测变量:

db4scoring_h2o["SIT_SADTPROV"]=db4scoring_h2o["SIT_SADTPROV"].asfactor()
db4scoring_h2o["PTT_SPTFRAZ"]=db4scoring_h2o["PTT_SPTFRAZ"].asfactor()

当我使用db4scoring.dtypes检查数据类型时,我注意到它们已正确设置但是当我在h2o中导入它时,我注意到h2oframe执行了一些不需要的转换到枚举(例如从float或int)。我想知道是否是在H2OFrame中指定变量格式的方法。

1 个答案:

答案 0 :(得分:1)

是的,有。请在此处查看H2OFrame文档:http://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/frame.html#h2oframe

你只需要在施放时使用column_types参数。

这是一个简短的例子:

# imports
import h2o
import numpy as np
import pandas as pd

# create small random pandas df
df = pd.DataFrame(np.random.randint(0,10,size=(10, 2)), 
columns=list('AB'))
print(df)

#   A  B
#0  5  0
#1  1  3
#2  4  8
#3  3  9
# ...

# start h2o, convert pandas frame to H2OFrame
# use column_types dict to set data types
h2o.init()
h2o_df = h2o.H2OFrame(df, column_types={'A':'numeric', 'B':'enum'})
h2o_df.describe() # you should now see the desired data types 

#       A   B
# type int enum
# ...