使用一种热编码的训练数据给出的系数结果比实际列多

时间:2019-06-05 10:45:22

标签: machine-learning pyspark

使用一键编码时,得到的结果集比实际列数更多

我已经添加了代码,也请看看我正在编写中

coefficient : [0.0002054800236568163,2.0439310112800845e-06,0.0012587034306473716,0.0003538955306262437,0.0014205218783369504,-0.09556139895866411,-0.01119907246997649,0.0009278595718565514,0.055504033414581995,-0.0060363295643237206,0.1208861923722965,-0.03708163001735046,-0.011924436110750052,0.18739103759110842,-0.06788345901273717,0.24122048812836505,-0.08719840615913002,-0.18789455768956798,0.2881887187896297,-0.13987095144035597,-0.016854358762055686,0.029427863518793968,-0.01918399191298753,0.011116841193397481,0.04191756597743858,-0.04191756597744139,-0.003281743064241399,0.0032817430642382403,-0.007199912662577535,0.007199912662575341,0.011613111115769799,-0.042503873680151225,0.10019922603083396,-0.34485589766428043,0.3756841570542743,-0.019416573355186505,0.37012264711363996]
features column : ['balance', 'day', 'duration', 'campaign', 'age', 'job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'poutcome']
oneHotEncodedFeatures = []
for colm in stringFeatures:
    indexer = StringIndexer(inputCol=colm, outputCol='indexed_' + colm).fit(dataset)
    indexed_features.append('indexed_' + colm)
    dataset = indexer.transform(dataset)
    encoder = OneHotEncoderEstimator(inputCols=['indexed_'+colm], outputCols=['encoded_'+colm], dropLast=True, handleInvalid='keep').fit(dataset)
    oneHotEncodedFeatures.append('encoded_'+colm)
    dataset = encoder.transform(dataset)
    dataset.show()

final_features = numericalFeatures + oneHotEncodedFeatures
featureassembler = VectorAssembler(inputCols=final_features,
                                   outputCol="features")
dataset = featureassembler.transform(dataset)
# vectorIndexer = VectorIndexer(inputCol='features', outputCol='vectorIndexedFeatures', maxCategories=4).fit(
#     dataset)
# dataset = vectorIndexer.transform(dataset)
trainDataRatioTransformed = self.trainDataRatio
testDataRatio = 1 - trainDataRatioTransformed
trainingData, testData = dataset.randomSplit([trainDataRatioTransformed, testDataRatio], seed=40)
# applying the model
lr = LinearRegression(featuresCol="features", labelCol=label)
regressor = lr.fit(trainingData)
locationAddress = 'hdfs://10.171.0.181:9000/dev/dmxdeepinsight/datasets/'
modelPersist = 'linearRegressorModel.parquet'
modelStorageLocation = locationAddress + userId + modelPersist
regressor.write().overwrite().save(modelStorageLocation)
# print regressor.featureImportances
# print(dataset.orderBy(feature_colm, ascending=True))
# pred = regressor.transform(testData)
# coefficeint & intercept
# saving the model and test dataset as csv file
print("coefficient : " + str(regressor.coefficients))
coefficient_t = str(regressor.coefficients)
# print("intercept : " + str(regressor.intercept))
intercept_t = str(regressor.intercept)
print('features column :',feature_colm)```
There should be only thirteen coefficients but it is showing twenty nine.

0 个答案:

没有答案