ValueError:对象类型<class'pandas.core.frame.dataframe'=“”>没有名为Column <rand()as =“”`crossvalidator` =“”>的轴

时间:2017-10-10 12:44:34

标签: pyspark

我尝试使用pyspark.ml来执行classification.RandomForest,使用交叉验证。

我已将CSV格式的输入文件转换为DataFrame格式。当我执行以下代码时,我将错误视为以下错误格式中提到的值错误。

下面是python代码。

import pyspark
import pandas as pd
import numpy as np
from pyspark.sql import SQLContext
from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.mllib.evaluation import BinaryClassificationMetrics

sc = pyspark.SparkContext()
sql = SQLContext(sc)

trainingData= pd.read_csv("CSVfilepath", index_col=0, parse_dates=True)

print trainingData
numFolds = 10 


rf = RandomForestClassifier(numTrees=100, maxDepth=5, maxBins=5, labelCol="label", featuresCol="features", seed=42)
evaluator = MulticlassClassificationEvaluator().setLabelCol("V5409").setPredictionCol("prediction").setMetricName("accuracy") 

paramGrid = ParamGridBuilder().build()

pipeline = Pipeline(stages=[rf])
paramGrid=ParamGridBuilder().build()
crossval = CrossValidator(
    estimator=pipeline,
    estimatorParamMaps=paramGrid,
    evaluator=evaluator,
    numFolds=numFolds)

model = crossval.fit(trainingData)

获得的错误是

Traceback (most recent call last):
  File "randomforest_cv.py", line 46, in <module>
    model = crossval.fit(trainingData)
  File "/home/hadoopuser/anaconda2/lib/python2.7/site-packages/pyspark/ml/base.py", line 64, in fit
    return self._fit(dataset)
  File "/home/hadoopuser/anaconda2/lib/python2.7/site-packages/pyspark/ml/tuning.py", line 224, in _fit
    df = dataset.select("*", rand(seed).alias(randCol))
  File "/home/hadoopuser/anaconda2/lib/python2.7/site-packages/pandas/core/generic.py", line 2085, in select
    axis = self._get_axis_number(axis)
  File "/home/hadoopuser/anaconda2/lib/python2.7/site-packages/pandas/core/generic.py", line 353, in _get_axis_number
    .format(axis, type(self)))
ValueError: No axis named Column<rand(-4372709618522015412) AS `CrossValidator_42cab674dd6c1d100ef0_rand`> for object type <class 'pandas.core.frame.DataFrame'>

有人可以帮助我解决问题以及如何解决问题。我猜问题是DataFrame格式。

0 个答案:

没有答案