使用loadlabledpoints RDD

时间:2017-08-10 13:57:39

标签: parsing apache-spark pyspark rdd libsvm

我正在使用pyspark

我读了一个libsvm文件,转置它,然后再次保存。

我将每个数据行保存为具有稀疏数据的MLUtils.labeledpoint对象

我尝试使用MLUtils.saveaslibsvm而不是使用MLUtils.loadlibsvm读取文件,我收到以下错误

  

ValueError:无法将字符串转换为float:[

     

在   org.apache.spark.api.python.PythonRunner $$匿名$ 1.read(PythonRDD.scala:193)     在   org.apache.spark.api.python.PythonRunner $$匿名$ 1(PythonRDD.scala:234)。     在   org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)     在org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)     在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)     在org.apache.spark.rdd.RDD $$ anonfun $ 8.apply(RDD.scala:336)at at   org.apache.spark.rdd.RDD $$ anonfun $ 8.apply(RDD.scala:334)at at   org.apache.spark.storage.BlockManager $$ anonfun $ doPutIterator $ 1.适用(BlockManager.scala:1055)     在   org.apache.spark.storage.BlockManager $$ anonfun $ doPutIterator $ 1.适用(BlockManager.scala:1029)     在   org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:969)     在   org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1029)     在   org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:760)     在org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)at at   org.apache.spark.rdd.RDD.iterator(RDD.scala:285)at   org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)at at   org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)at at   org.apache.spark.rdd.RDD.iterator(RDD.scala:287)at   org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)at at   org.apache.spark.scheduler.Task.run(Task.scala:108)at   org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:335)     在   java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)     在   java.util.concurrent.ThreadPoolExecutor中的$ Worker.run(ThreadPoolExecutor.java:624)     ......还有1个

我在MLUtils页面中读到,如果你想使用loadlabeledpoints,你需要使用RDD.saveAsTextFile保存数据,但当我这样做时,我得到了

  

17/08/10 16:55:51 WARN TaskSetManager:阶段1.0中的丢失任务1.0(TID   3,192.168.1.205,执行者0):org.apache.spark.SparkException:不能   解析双人:[at   org.apache.spark.mllib.util.NumericParser $ .parseDouble(NumericParser.scala:120)     在   org.apache.spark.mllib.util.NumericParser $ .parseArray(NumericParser.scala:70)     在   org.apache.spark.mllib.util.NumericParser $ .parseTuple(NumericParser.scala:91)     在   org.apache.spark.mllib.util.NumericParser $ .parse(NumericParser.scala:41)     在   org.apache.spark.mllib.regression.LabeledPoint $ .parse(LabeledPoint.scala:62)     在   org.apache.spark.mllib.util.MLUtils $$ anonfun $ loadLabeledPoints $ 1.适用(MLUtils.scala:195)     在   org.apache.spark.mllib.util.MLUtils $$ anonfun $ loadLabeledPoints $ 1.适用(MLUtils.scala:195)     在scala.collection.Iterator $$ anon $ 11.next(Iterator.scala:409)at   org.apache.spark.api.python.SerDeUtil $ AutoBatchedPickler.next(SerDeUtil.scala:121)     在   org.apache.spark.api.python.SerDeUtil $ AutoBatchedPickler.next(SerDeUtil.scala:112)     在scala.collection.Iterator $ class.foreach(Iterator.scala:893)at   org.apache.spark.api.python.SerDeUtil $ AutoBatchedPickler.foreach(SerDeUtil.scala:112)     在   scala.collection.generic.Growable $类$加$加$ EQ(Growable.scala:59)。     在   scala.collection.mutable.ArrayBuffer $加$加$ EQ(ArrayBuffer.scala:104)。     在   scala.collection.mutable.ArrayBuffer $加$加$ EQ(ArrayBuffer.scala:48)。     在   scala.collection.TraversableOnce $ class.to(TraversableOnce.scala:310)     在   org.apache.spark.api.python.SerDeUtil $ AutoBatchedPickler.to(SerDeUtil.scala:112)     在   scala.collection.TraversableOnce $ class.toBuffer(TraversableOnce.scala:302)     在   org.apache.spark.api.python.SerDeUtil $ AutoBatchedPickler.toBuffer(SerDeUtil.scala:112)     在   scala.collection.TraversableOnce $ class.toArray(TraversableOnce.scala:289)     在   org.apache.spark.api.python.SerDeUtil $ AutoBatchedPickler.toArray(SerDeUtil.scala:112)     在   org.apache.spark.rdd.RDD $$ anonfun $收集$ 1 $$ anonfun $ 13.apply(RDD.scala:936)     在   org.apache.spark.rdd.RDD $$ anonfun $收集$ 1 $$ anonfun $ 13.apply(RDD.scala:936)     在   org.apache.spark.SparkContext $$ anonfun $ runJob $ 5.apply(SparkContext.scala:2062)     在   org.apache.spark.SparkContext $$ anonfun $ runJob $ 5.apply(SparkContext.scala:2062)     在org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)     在org.apache.spark.scheduler.Task.run(Task.scala:108)at   org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:335)     在   java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)     在   java.util.concurrent.ThreadPoolExecutor中的$ Worker.run(ThreadPoolExecutor.java:624)     在java.lang.Thread.run(Thread.java:748)引起:   java.lang.NumberFormatException:对于输入字符串:" ["在   sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2043)     在sun.misc.FloatingDecimal.parseDouble(FloatingDecimal.java:110)at at   java.lang.Double.parseDouble(Double.java:538)at   org.apache.spark.mllib.util.NumericParser $ .parseDouble(NumericParser.scala:117)     ......还有30多个

如何将标记点的RDD保存为libsvm格式,然后使用pyspark将其从磁盘加载回来?

谢谢

1 个答案:

答案 0 :(得分:0)

问题是将LabledPoints写入文件时没有使用libsvm格式,然后很难重新读取它。

我通过在内存中创建标记点来解决它,然后在将其写入文件之前,我将其转换为libsvm格式字符串,然后将其写为文本,之后,我能够将其作为libsvm格式读取

def pointToLibsvmRow(point):
    s = point.features.reshape(2,-1, order="C").transpose().astype("str")
    pairs = [str(int(float(point.label)))] + ["%s:%s" % (str(int(float(a))), b) for a, b in s.tolist()]
    st = " ".join(pairs)
    return st