PySpark-版本2.4.0
我正在尝试减少输出层数。不幸的是,无法使用PySpark MLPC实现它。
我用过字母数据集。 链接:https://raw.githubusercontent.com/stedy/Machine-Learning-with-R-datasets/master/letterdata.csv
>>> df.show(1)
+------+----+----+-----+------+-----+----+----+-----+-----+-----+------+------+-----+------+-----+------+
|letter|xbox|ybox|width|height|onpix|xbar|ybar|x2bar| y2bar|xybar| x2ybar| xy2bar|xedge|xedgey|yedge|yedgex|
+------+----+----+-----+------+-----+----+----+-----+-----+-----+------+------+-----+------+-----+------+
| T| 2| 8| 3| 5| 1| 8| 13| 0| 6| 6| 10| 8| 0| 8| 0| 8|
+------+----+----+-----+------+-----+----+----+-----+-----+-----+------+------+-----+------+-----+------+
only showing top 1 row
我已经使用StringIndexer
将“字母”列转换为整数(“列=“ letter_id”)。
我有两个选择,或者使用letter_id作为具有26个输出的输出层。
>>> indexed_df.columns[1:-1]
['xbox', 'ybox', 'width', 'height', 'onpix', 'xbar', 'ybar', 'x2bar', 'y2bar', 'xybar', 'x2ybar', 'xy2bar', 'xedge', 'xedgey', 'yedge', 'yedgex', 'letter_id']
或者,创建一个二进制列作为输出,该列将只有5个输出。
>>> indexed_df.select('letter', 'letter_id', 'binary').distinct().show()
+------+---------+------+
|letter|letter_id|binary|
+------+---------+------+
| D| 1| 00001|
这是完整的代码,
from pyspark.ml.classification import MultilayerPerceptronClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import StringIndexer
from pyspark.sql.functions import *
df = spark.read.csv('letterdata.csv', header=True, inferSchema=True)
indexer = StringIndexer(inputCol='letter', outputCol='letter_id')
indexed_df = indexer.fit(df).transform(df)
indexed_df = indexed_df.withColumn('letter_id', indexed_df['letter_id'].cast('int'))
indexed_df.select('letter', 'letter_id').distinct().show()
udf_binary = udf(lambda x: '{0:05b}'.format(x))
indexed_df = indexed_df.withColumn('binary', udf_binary(indexed_df['letter_id']))
indexed_df.select('letter', 'letter_id', 'binary').distinct().show()
final_cols = indexed_df.columns[1:-1]
#final_cols = indexed_df.columns[1:-2] + ['binary']
dataset = indexed_df.select(final_cols)
parser = VectorAssembler(inputCols=final_cols[:-1], outputCol="features")
dataset = parser.transform(dataset)
final = dataset.select(col('letter_id').alias('label'), col('features'))
#final = dataset.select(col('binary').alias('label'), col('features'))
splits = final.randomSplit([0.6, 0.4], 1234)
train = splits[0]
test = splits[1]
layers = [17, 30, 20, 26]
trainer = MultilayerPerceptronClassifier(maxIter=100, layers=layers, blockSize=128, seed=1234)
在尝试健身时
model = trainer.fit(train)
出现错误,
2018-12-24 11:17:27 WARN BlockManager:66 - Putting block rdd_44_0 failed due to exception java.lang.ArrayIndexOutOfBoundsException.
2018-12-24 11:17:27 WARN BlockManager:66 - Block rdd_44_0 could not be removed as it was not found on disk or in memory
2018-12-24 11:17:27 ERROR Executor:91 - Exception in task 0.0 in stage 26.0 (TID 336)
java.lang.ArrayIndexOutOfBoundsException
at java.lang.System.arraycopy(Native Method)
at org.apache.spark.ml.ann.DataStacker$$anonfun$5$$anonfun$apply$3$$anonfun$apply$4.apply(Layer.scala:665)
at org.apache.spark.ml.ann.DataStacker$$anonfun$5$$anonfun$apply$3$$anonfun$apply$4.apply(Layer.scala:664)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.apache.spark.ml.ann.DataStacker$$anonfun$5$$anonfun$apply$3.apply(Layer.scala:664)
at org.apache.spark.ml.ann.DataStacker$$anonfun$5$$anonfun$apply$3.apply(Layer.scala:660)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
at org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:221)
at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:298)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1165)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:286)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2018-12-24 11:17:27 WARN TaskSetManager:66 - Lost task 0.0 in stage 26.0 (TID 336, localhost, executor driver): java.lang.ArrayIndexOutOfBoundsException
at java.lang.System.arraycopy(Native Method)
at org.apache.spark.ml.ann.DataStacker$$anonfun$5$$anonfun$apply$3$$anonfun$apply$4.apply(Layer.scala:665)
at org.apache.spark.ml.ann.DataStacker$$anonfun$5$$anonfun$apply$3$$anonfun$apply$4.apply(Layer.scala:664)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.apache.spark.ml.ann.DataStacker$$anonfun$5$$anonfun$apply$3.apply(Layer.scala:664)
找不到对以上错误的任何引用。更改为['label','features']时,数据格式似乎有问题,因此无法继续进行。
还有如何在MLPC中实现5个输出而不是26个?有指针吗?