Pyspark MLP神经网络

时间:2018-12-24 11:40:25

标签: python-2.7 apache-spark pyspark neural-network apache-spark-ml

PySpark-版本2.4.0

我正在尝试减少输出层数。不幸的是,无法使用PySpark MLPC实现它。

我用过字母数据集。 链接:https://raw.githubusercontent.com/stedy/Machine-Learning-with-R-datasets/master/letterdata.csv

>>> df.show(1)
+------+----+----+-----+------+-----+----+----+-----+-----+-----+------+------+-----+------+-----+------+
|letter|xbox|ybox|width|height|onpix|xbar|ybar|x2bar| y2bar|xybar| x2ybar| xy2bar|xedge|xedgey|yedge|yedgex|
+------+----+----+-----+------+-----+----+----+-----+-----+-----+------+------+-----+------+-----+------+
|     T|   2|   8|    3|     5|    1|   8|  13|    0|    6|    6|    10|     8|    0|     8|    0|     8|
+------+----+----+-----+------+-----+----+----+-----+-----+-----+------+------+-----+------+-----+------+
only showing top 1 row

我已经使用StringIndexer将“字母”列转换为整数(“列=“ letter_id”)。

我有两个选择,或者使用letter_id作为具有26个输出的输出层。

>>> indexed_df.columns[1:-1]
['xbox', 'ybox', 'width', 'height', 'onpix', 'xbar', 'ybar', 'x2bar', 'y2bar', 'xybar', 'x2ybar', 'xy2bar', 'xedge', 'xedgey', 'yedge', 'yedgex', 'letter_id']

或者,创建一个二进制列作为输出,该列将只有5个输出。

>>> indexed_df.select('letter', 'letter_id', 'binary').distinct().show()
+------+---------+------+                                                       
|letter|letter_id|binary|
+------+---------+------+
|     D|        1| 00001|

这是完整的代码,

from pyspark.ml.classification import MultilayerPerceptronClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import StringIndexer
from pyspark.sql.functions import *


df = spark.read.csv('letterdata.csv', header=True, inferSchema=True)

indexer = StringIndexer(inputCol='letter', outputCol='letter_id')
indexed_df = indexer.fit(df).transform(df)
indexed_df = indexed_df.withColumn('letter_id', indexed_df['letter_id'].cast('int'))
indexed_df.select('letter', 'letter_id').distinct().show()

udf_binary = udf(lambda x: '{0:05b}'.format(x))

indexed_df = indexed_df.withColumn('binary', udf_binary(indexed_df['letter_id']))

indexed_df.select('letter', 'letter_id', 'binary').distinct().show()

final_cols = indexed_df.columns[1:-1]
#final_cols = indexed_df.columns[1:-2] + ['binary']

dataset = indexed_df.select(final_cols)

parser = VectorAssembler(inputCols=final_cols[:-1], outputCol="features")
dataset = parser.transform(dataset)
final = dataset.select(col('letter_id').alias('label'), col('features'))
#final = dataset.select(col('binary').alias('label'), col('features'))

splits = final.randomSplit([0.6, 0.4], 1234)
train = splits[0]
test = splits[1]
layers = [17, 30, 20, 26]

trainer = MultilayerPerceptronClassifier(maxIter=100, layers=layers, blockSize=128, seed=1234)

在尝试健身时

model = trainer.fit(train)

出现错误,

2018-12-24 11:17:27 WARN  BlockManager:66 - Putting block rdd_44_0 failed due to exception java.lang.ArrayIndexOutOfBoundsException.
2018-12-24 11:17:27 WARN  BlockManager:66 - Block rdd_44_0 could not be removed as it was not found on disk or in memory
2018-12-24 11:17:27 ERROR Executor:91 - Exception in task 0.0 in stage 26.0 (TID 336)
java.lang.ArrayIndexOutOfBoundsException
        at java.lang.System.arraycopy(Native Method)
        at org.apache.spark.ml.ann.DataStacker$$anonfun$5$$anonfun$apply$3$$anonfun$apply$4.apply(Layer.scala:665)
        at org.apache.spark.ml.ann.DataStacker$$anonfun$5$$anonfun$apply$3$$anonfun$apply$4.apply(Layer.scala:664)
        at scala.collection.immutable.List.foreach(List.scala:392)
        at org.apache.spark.ml.ann.DataStacker$$anonfun$5$$anonfun$apply$3.apply(Layer.scala:664)
        at org.apache.spark.ml.ann.DataStacker$$anonfun$5$$anonfun$apply$3.apply(Layer.scala:660)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
        at org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:221)
        at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:298)
        at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1165)
        at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
        at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
        at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)
        at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882)
        at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:286)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:121)
        at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
2018-12-24 11:17:27 WARN  TaskSetManager:66 - Lost task 0.0 in stage 26.0 (TID 336, localhost, executor driver): java.lang.ArrayIndexOutOfBoundsException
        at java.lang.System.arraycopy(Native Method)
        at org.apache.spark.ml.ann.DataStacker$$anonfun$5$$anonfun$apply$3$$anonfun$apply$4.apply(Layer.scala:665)
        at org.apache.spark.ml.ann.DataStacker$$anonfun$5$$anonfun$apply$3$$anonfun$apply$4.apply(Layer.scala:664)
        at scala.collection.immutable.List.foreach(List.scala:392)
        at org.apache.spark.ml.ann.DataStacker$$anonfun$5$$anonfun$apply$3.apply(Layer.scala:664)

找不到对以上错误的任何引用。更改为['label','features']时,数据格式似乎有问题,因此无法继续进行。

还有如何在MLPC中实现5个输出而不是26个?有指针吗?

0 个答案:

没有答案