Pyspark - 分类实施

时间:2017-07-18 08:55:27

标签: apache-spark pyspark apache-spark-mllib

我有一个用例来预测Multi Class标记值。我对使用Pyspark Implementation进行数据准备存在基本疑虑。

假设我有以下数据集:

A      B      C      Label
10   class1  Boy    Cricket  
12   class3  Boy    Football
11.6 class2  Girl   Hockey
..
..
..
..
12.2 class1 Girl   Hockey  

这是我的数据集,除了功能A,一切都是绝对的。

假设我们正在使用决策树分类器进行多类预测。

我已经完成了这个数据准备步骤:

Step1 :功能A的Min Max规范化器

Step2 :功能A,B,C的字符串索引器

step3 :功能A,B,C的单热编码转换

现在,转换的数据框看起来像,

A     B        B_Indexed_transformed   C    C_Indexed_transformed   Label
0.86  Class1    [5,1,[1.0,1.0]]        Boy   [2,1,[1.0,1.0]]        Cricket
.
.
.

接下来,我将保留A,B_Indexed_transformed,C_Indexed_transformed,标记列并删除所有其他列。

Step4 :使用[标签,功能]对

创建LabeledPoint数据

所以,我的问题是,为了将这些数据传递给决策树算法(或任何其他分类器),我是否需要对Label列进行任何转换

我已经为Label列完成了String Indexing。这是正确的方法吗?

当我将字符串索引标签列传递给LabeledPont转换时,我遇到了这个错误:

  File "/vol1/cloudera/parcels/CDH-5.8.3-1.cdh5.8.3.p0.2/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 111, in main
    process()
  File "/vol1/cloudera/parcels/CDH-5.8.3-1.cdh5.8.3.p0.2/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 106, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/vol1/cloudera/parcels/CDH-5.8.3-1.cdh5.8.3.p0.2/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 263, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "/vol1/cloudera/parcels/CDH-5.8.3-1.cdh5.8.3.p0.2/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1293, in takeUpToNumLeft
  File "/vol1/cloudera/parcels/CDH-5.8.3-1.cdh5.8.3.p0.2/lib/spark/python/lib/pyspark.zip/pyspark/rddsampler.py", line 95, in func
  File "/hba01/yarn/nm/usercache/sbeathanabhotla/appcache/application_1498495374459_1420410/container_1498495374459_1420410_01_000001/build.zip/com/ci/roletagging/service/ModelBuilder.py", line 17, in <lambda>
  File "/vol1/cloudera/parcels/CDH-5.8.3-1.cdh5.8.3.p0.2/lib/spark/python/lib/pyspark.zip/pyspark/mllib/regression.py", line 51, in __init__
    self.label = float(label)
TypeError: float() argument must be a string or a number

如果我将Label数据作为字符串传递而没有任何转换,我将面临此错误:

Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/vol1/cloudera/parcels/CDH-5.8.3-1.cdh5.8.3.p0.2/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 111, in main
    process()
  File "/vol1/cloudera/parcels/CDH-5.8.3-1.cdh5.8.3.p0.2/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 106, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/vol1/cloudera/parcels/CDH-5.8.3-1.cdh5.8.3.p0.2/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 263, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "/vol1/cloudera/parcels/CDH-5.8.3-1.cdh5.8.3.p0.2/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1293, in takeUpToNumLeft
  File "/vol1/cloudera/parcels/CDH-5.8.3-1.cdh5.8.3.p0.2/lib/spark/python/lib/pyspark.zip/pyspark/rddsampler.py", line 95, in func
  File "/hba06/yarn/nm/usercache/sbeathanabhotla/appcache/application_1498495374459_1425329/container_1498495374459_1425329_01_000001/build.zip/com/ci/roletagging/service/ModelBuilder.py", line 17, in <lambda>
  File "/vol1/cloudera/parcels/CDH-5.8.3-1.cdh5.8.3.p0.2/lib/spark/python/lib/pyspark.zip/pyspark/mllib/regression.py", line 51, in __init__
    self.label = float(label)
ValueError: could not convert string to float: businessevaluator

0 个答案:

没有答案