如何读取“调用地图功能后的RDD”?

时间:2019-05-16 17:07:10

标签: python-3.x pyspark apache-spark-2.0

我正在尝试使用地图功能将RDD拆分后读取RDD。请在下面找到代码:

enter image description here

def split_(line):
    values = line.split(",")
    return values
edges = edges.rdd.map(split_)

现在,我正在尝试读取输出:

all_edges = edges.collect()
print(all_edges)

我希望有一个列表[1,2],但出现错误。

Py4JJavaError                             Traceback (most recent call last)
<ipython-input-27-c21ec2fd3595> in <module>
----> 1 all_edges = edges.collect()
      2 print(all_edges)

但是当我尝试读取RDD时,出现错误:

  • 错误代码:


    Py4JJavaError Traceback(最近一次通话)  在 ----> 1 edge.collect()

    收集中的C:\ opt \ spark \ spark-2.4.1-bin-hadoop2.7 \ python \ pyspark \ rdd.py     814“”“     815与SCCallSiteSync(self.context)作为CSS: - > 816 sock_info = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())     817返回列表(_load_from_socket(sock_info,self._jrdd_deserializer))     818

    C:\ opt \ spark \ spark-2.4.1-bin-hadoop2.7 \ python \ lib \ py4j-0.10.7-src.zip \ py4j \ java_gateway.py在调用中(* args)    1255答案= self.gateway_client.send_command(命令)    1256 = -> 1257个答案,self.gateway_client,self.target_id,self.name)    1258    temp_args中的temp_arg为1259:

    C:\ opt \ spark \ spark-2.4.1-bin-hadoop2.7 \ python \ pyspark \ sql \ utils.py in deco(* a,** kw)      61 def deco(* a,** kw):      62试试: ---> 63返回f(* a,** kw)      64,除了py4j.protocol.Py4JJavaError如e:      65 s = e.java_exception.toString()

    C:\ opt \ spark \ spark-2.4.1-bin-hadoop2.7 \ python \ lib \ py4j-0.10.7-src.zip \ py4j \ protocol.py in get_return_value(answer,gateway_client,target_id,名称)     第326章     327“调用{0} {1} {2}时发生错误。\ n”。 -> 328格式(target_id,“。”,名称),值)     329其他:     330提高Py4JError(

    Py4JJavaError:调用z:org.apache.spark.api.python.PythonRDD.collectAndServe时发生错误。 :org.apache.spark.SparkException:由于阶段失败而导致作业中止:阶段1.0中的任务0失败1次,最近一次失败:阶段1.0中的任务0.0(TID 1,本地主机,执行程序驱动程序)丢失:org.apache.spark .api.python.PythonException:追溯(最近一次呼叫过去):    getattr 中的文件“ C:\ opt \ spark \ spark-2.4.1-bin-hadoop2.7 \ python \ lib \ pyspark.zip \ pyspark \ sql \ types.py”,第1527行     idx = self。字段 .index(项目) ValueError:“拆分”不在列表中

    在处理上述异常期间,发生了另一个异常:

    回溯(最近通话最近):   主文件夹中的文件“ C:\ opt \ spark \ spark-2.4.1-bin-hadoop2.7 \ python \ lib \ pyspark.zip \ pyspark \ worker.py”   正在处理文件“ C:\ opt \ spark \ spark-2.4.1-bin-hadoop2.7 \ python \ lib \ pyspark.zip \ pyspark \ worker.py”,第372行   dump_stream中的文件“ C:\ opt \ spark \ spark-2.4.1-bin-hadoop2.7 \ python \ lib \ pyspark.zip \ pyspark \ serializers.py”,第393行     vs = list(itertools.islice(迭代器,批处理))   包装中的文件``C:\ opt \ spark \ spark-2.4.1-bin-hadoop2.7 \ python \ lib \ pyspark.zip \ pyspark \ util.py'',第99行     返回f(* args,** kwargs)   文件“”,第2行,在split_    getattr 中的文件“ C:\ opt \ spark \ spark-2.4.1-bin-hadoop2.7 \ python \ lib \ pyspark.zip \ pyspark \ sql \ types.py”,第1532行     引发AttributeError(item) AttributeError:拆分

    at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:452)
    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:588)
    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:571)
    at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:406)
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
    at scala.collection.Iterator$class.foreach(Iterator.scala:891)
    at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
    at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
    at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
    at org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28)
    at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
    at org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28)
    at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
    at org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)
    at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:945)
    at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:945)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:121)
    at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:403)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:409)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
    

    驱动程序堆栈跟踪:     在org.apache.spark.scheduler.DAGScheduler.org $ apache $ spark $ scheduler $ DAGScheduler $$ failJobAndIndependentStages(DAGScheduler.scala:1889)中     在org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.apply(DAGScheduler.scala:1877)     在org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.apply(DAGScheduler.scala:1876)     在scala.collection.mutable.ResizableArray $ class.foreach(ResizableArray.scala:59)     在scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)     在org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876)     位于org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskSetFailed $ 1.apply(DAGScheduler.scala:926)     位于org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskSetFailed $ 1.apply(DAGScheduler.scala:926)     在scala.Option.foreach(Option.scala:257)     在org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)     在org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110)     在org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059)     在org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048)     在org.apache.spark.util.EventLoop $$ anon $ 1.run(EventLoop.scala:49)     在org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)     在org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)     在org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)     在org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)     在org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)     在org.apache.spark.rdd.RDD $$ anonfun $ collect $ 1.apply(RDD.scala:945)     在org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:151)     在org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:112)     在org.apache.spark.rdd.RDD.withScope(RDD.scala:363)     在org.apache.spark.rdd.RDD.collect(RDD.scala:944)     在org.apache.spark.api.python.PythonRDD $ .collectAndServe(PythonRDD.scala:166)     在org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)     在sun.reflect.NativeMethodAccessorImpl.invoke0(本机方法)处     在sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)     在sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)     在java.lang.reflect.Method.invoke(Method.java:498)     在py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)     在py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)     在py4j.Gateway.invoke(Gateway.java:282)     在py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)     在py4j.commands.CallCommand.execute(CallCommand.java:79)     在py4j.GatewayConnection.run(GatewayConnection.java:238)     在java.lang.Thread.run(Thread.java:748) 由以下原因引起:org.apache.spark.api.python.PythonException:追溯(最近一次呼叫过去):    getattr 中的文件“ C:\ opt \ spark \ spark-2.4.1-bin-hadoop2.7 \ python \ lib \ pyspark.zip \ pyspark \ sql \ types.py”,第1527行     idx = self。字段 .index(项目) ValueError:“拆分”不在列表中

    在处理上述异常期间,发生了另一个异常:

    回溯(最近通话最近):   主文件夹中的文件“ C:\ opt \ spark \ spark-2.4.1-bin-hadoop2.7 \ python \ lib \ pyspark.zip \ pyspark \ worker.py”   正在处理文件“ C:\ opt \ spark \ spark-2.4.1-bin-hadoop2.7 \ python \ lib \ pyspark.zip \ pyspark \ worker.py”,第372行   dump_stream中的文件“ C:\ opt \ spark \ spark-2.4.1-bin-hadoop2.7 \ python \ lib \ pyspark.zip \ pyspark \ serializers.py”,第393行     vs = list(itertools.islice(迭代器,批处理))   包装中的文件``C:\ opt \ spark \ spark-2.4.1-bin-hadoop2.7 \ python \ lib \ pyspark.zip \ pyspark \ util.py'',第99行     返回f(* args,** kwargs)   文件“”,第2行,在split_    getattr 中的文件“ C:\ opt \ spark \ spark-2.4.1-bin-hadoop2.7 \ python \ lib \ pyspark.zip \ pyspark \ sql \ types.py”,第1532行     引发AttributeError(item) AttributeError:拆分

    at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:452)
    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:588)
    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:571)
    at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:406)
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
    at scala.collection.Iterator$class.foreach(Iterator.scala:891)
    at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
    at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
    at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
    at org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28)
    at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
    at org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28)
    at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
    at org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)
    at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:945)
    at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:945)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:121)
    at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:403)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:409)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    ... 1 more
    

0 个答案:

没有答案