我有一个用例来预测Multi Class标记值。我对使用Pyspark Implementation进行数据准备存在基本疑虑。
假设我有以下数据集:
A B C Label
10 class1 Boy Cricket
12 class3 Boy Football
11.6 class2 Girl Hockey
..
..
..
..
12.2 class1 Girl Hockey
这是我的数据集,除了功能A,一切都是绝对的。
假设我们正在使用决策树分类器进行多类预测。
我已经完成了这个数据准备步骤:
Step1 :功能A的Min Max规范化器
Step2 :功能A,B,C的字符串索引器
step3 :功能A,B,C的单热编码转换
现在,转换的数据框看起来像,
A B B_Indexed_transformed C C_Indexed_transformed Label
0.86 Class1 [5,1,[1.0,1.0]] Boy [2,1,[1.0,1.0]] Cricket
.
.
.
接下来,我将保留A,B_Indexed_transformed,C_Indexed_transformed,标记列并删除所有其他列。
Step4 :使用[标签,功能]对
创建LabeledPoint数据所以,我的问题是,为了将这些数据传递给决策树算法(或任何其他分类器),我是否需要对Label列进行任何转换。
我已经为Label列完成了String Indexing。这是正确的方法吗?
当我将字符串索引标签列传递给LabeledPont转换时,我遇到了这个错误:
File "/vol1/cloudera/parcels/CDH-5.8.3-1.cdh5.8.3.p0.2/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 111, in main
process()
File "/vol1/cloudera/parcels/CDH-5.8.3-1.cdh5.8.3.p0.2/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 106, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/vol1/cloudera/parcels/CDH-5.8.3-1.cdh5.8.3.p0.2/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 263, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "/vol1/cloudera/parcels/CDH-5.8.3-1.cdh5.8.3.p0.2/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1293, in takeUpToNumLeft
File "/vol1/cloudera/parcels/CDH-5.8.3-1.cdh5.8.3.p0.2/lib/spark/python/lib/pyspark.zip/pyspark/rddsampler.py", line 95, in func
File "/hba01/yarn/nm/usercache/sbeathanabhotla/appcache/application_1498495374459_1420410/container_1498495374459_1420410_01_000001/build.zip/com/ci/roletagging/service/ModelBuilder.py", line 17, in <lambda>
File "/vol1/cloudera/parcels/CDH-5.8.3-1.cdh5.8.3.p0.2/lib/spark/python/lib/pyspark.zip/pyspark/mllib/regression.py", line 51, in __init__
self.label = float(label)
TypeError: float() argument must be a string or a number
如果我将Label数据作为字符串传递而没有任何转换,我将面临此错误:
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/vol1/cloudera/parcels/CDH-5.8.3-1.cdh5.8.3.p0.2/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 111, in main
process()
File "/vol1/cloudera/parcels/CDH-5.8.3-1.cdh5.8.3.p0.2/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 106, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/vol1/cloudera/parcels/CDH-5.8.3-1.cdh5.8.3.p0.2/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 263, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "/vol1/cloudera/parcels/CDH-5.8.3-1.cdh5.8.3.p0.2/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1293, in takeUpToNumLeft
File "/vol1/cloudera/parcels/CDH-5.8.3-1.cdh5.8.3.p0.2/lib/spark/python/lib/pyspark.zip/pyspark/rddsampler.py", line 95, in func
File "/hba06/yarn/nm/usercache/sbeathanabhotla/appcache/application_1498495374459_1425329/container_1498495374459_1425329_01_000001/build.zip/com/ci/roletagging/service/ModelBuilder.py", line 17, in <lambda>
File "/vol1/cloudera/parcels/CDH-5.8.3-1.cdh5.8.3.p0.2/lib/spark/python/lib/pyspark.zip/pyspark/mllib/regression.py", line 51, in __init__
self.label = float(label)
ValueError: could not convert string to float: businessevaluator