我正在尝试使用pyspark.ml库为数据帧构建随机林分类器(而不是mllib用于RDD )。 我是否必须使用文档中给出的管道? 我只想构建一个简单的模型,
rf = RandomForestClassifier(labelCol = labs, featuresCol = rawdata)
我遇到以下错误
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/spark/python/pyspark/__init__.py", line 104, in wrapper
return func(self, **kwargs)
File "/usr/lib/spark/python/pyspark/ml/classification.py", line 910, in __init
__
self.setParams(**kwargs)
File "/usr/lib/spark/python/pyspark/__init__.py", line 104, in wrapper
return func(self, **kwargs)
File "/usr/lib/spark/python/pyspark/ml/classification.py", line 928, in setPar
ams
return self._set(**kwargs)
File "/usr/lib/spark/python/pyspark/ml/param/__init__.py", line 421, in _set
raise TypeError('Invalid param value given for param "%s". %s' % (p.name, e)
)
TypeError: Invalid param value given for param "labelCol". Could not convert <cl
ass 'pyspark.sql.dataframe.DataFrame'> to string type
我的标签样本
+---+
| _2|
+---+
|0.0|
|1.0|
|0.0|
|0.0|
|0.0|
|0.0|
|1.0|
|1.0|
|1.0|
|0.0|
|0.0|
|0.0|
|0.0|
|0.0|
|0.0|
|0.0|
|0.0|
|0.0|
|1.0|
|1.0|
+---+
我的数据类似于180列。
答案 0 :(得分:1)
Spark数据帧不像Spark ML那样使用;您的所有功能都需要是单列中的向量,通常(但不一定)名为features
。另外,labelcol=labs
表示您的标签需要位于名为labs
的列中,而不是_2
。
以下是使用玩具数据获取想法的示例:
spark.version
# u'2.2.0'
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.linalg import Vectors
df = sqlContext.createDataFrame([
(0.0, Vectors.dense(0.0, 1.0)),
(1.0, Vectors.dense(1.0, 0.0))],
["label", "features"])
df.show() # notice there are only 2 columns, and 'features' is a 2-d vector
# +-----+---------+
# |label| features|
# +-----+---------+
# | 0.0|[0.0,1.0]|
# | 1.0|[1.0,0.0]|
# +-----+---------+
rf = RandomForestClassifier(numTrees=3, maxDepth=2, labelCol="label", seed=42)
rf_model = rf.fit(df)
This answer of mine可能有助于您以所需的格式转换数据。