随机森林使用pyspark.ml for Dataframes

时间:2017-10-18 15:42:40

标签: machine-learning pyspark random-forest apache-spark-ml

我正在尝试使用pyspark.ml库为数据帧构建随机林分类器(而不是mllib用于RDD )。 我是否必须使用文档中给出的管道? 我只想构建一个简单的模型,

rf = RandomForestClassifier(labelCol = labs, featuresCol = rawdata) 

我遇到以下错误

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/spark/python/pyspark/__init__.py", line 104, in wrapper
    return func(self, **kwargs)
  File "/usr/lib/spark/python/pyspark/ml/classification.py", line 910, in __init
__
    self.setParams(**kwargs)
  File "/usr/lib/spark/python/pyspark/__init__.py", line 104, in wrapper
    return func(self, **kwargs)
  File "/usr/lib/spark/python/pyspark/ml/classification.py", line 928, in setPar
ams
    return self._set(**kwargs)
  File "/usr/lib/spark/python/pyspark/ml/param/__init__.py", line 421, in _set
    raise TypeError('Invalid param value given for param "%s". %s' % (p.name, e)
)
TypeError: Invalid param value given for param "labelCol". Could not convert <cl
ass 'pyspark.sql.dataframe.DataFrame'> to string type

我的标签样本

+---+
| _2|
+---+
|0.0|
|1.0|
|0.0|
|0.0|
|0.0|
|0.0|
|1.0|
|1.0|
|1.0|
|0.0|
|0.0|
|0.0|
|0.0|
|0.0|
|0.0|
|0.0|
|0.0|
|0.0|
|1.0|
|1.0|
+---+

我的数据类似于180列。

1 个答案:

答案 0 :(得分:1)

Spark数据帧不像Spark ML那样使用;您的所有功能都需要是列中的向量,通常(但不一定)名为features。另外,labelcol=labs表示您的标签需要位于名为labs的列中,而不是_2

以下是使用玩具数据获取想法的示例:

spark.version
# u'2.2.0'

from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.linalg import Vectors
df = sqlContext.createDataFrame([
     (0.0, Vectors.dense(0.0, 1.0)),
     (1.0, Vectors.dense(1.0, 0.0))], 
     ["label", "features"])

df.show() # notice there are only 2 columns, and 'features' is a 2-d vector
# +-----+---------+ 
# |label| features|
# +-----+---------+ 
# |  0.0|[0.0,1.0]|
# |  1.0|[1.0,0.0]|
# +-----+---------+

rf = RandomForestClassifier(numTrees=3, maxDepth=2, labelCol="label", seed=42)
rf_model = rf.fit(df)

This answer of mine可能有助于您以所需的格式转换数据。