Collect()函数在某些情况下显示错误,而在其他情况下则完美运行

时间:2019-05-30 11:11:51

标签: python apache-spark pyspark

我正在研究两个不同的RDD:
1)[(2,3),(3,4),(4,5),(7,8)]
2)[[((4,2),(2,1)),((4,2),(-3,4)),((4,2),(6,3)),((2, 1),(4,2))]
我的代码是-

test_cartesian=get_cartesian(rdd).map(lambda c: find_slope(c)).groupByKey().filter(lambda t: len(t[1]) >= 2).flatMapValues(lambda x: x).filter(lambda x: x).groupByKey().mapValues(list).collect()  

其中get_cartesian()find_slope()是用户定义的函数,其代码如下-

def get_cartesian(rdd):
    return rdd.cartesian(rdd).filter(lambda row: row[0]!=row[1])   


def find_slope(x):
    if x[0][0]==x[1][0]:
        slope = "inf"
    else:
        slope= (x[1][1]-x[0][1])/(x[1][0]-x[0][0])
    result= ((x[0],slope),x[1])
    return result  

collect()函数给出了1)RDD的结果,但是第二个RDD却出现了错误,错误是-

  

Py4JJavaError Traceback(最近的呼叫   最后)在()         1个test_rdd = sc.parallelize([[((4,2),(2,1)),((4,2),(-3,4)),((4,2),(6,3)) ,(((2,1),(4,2)),((2,1),(-3、4)),((2,1),   (6,3)),((-3,4),(4,2)),((-3,4),(2,1)),((-3,4),(6,3) ),((6,   3),(4、2)),((6、3),(2、1)),((6、3),(-3、4))])   ----> 2 assert isinstance(find_collinear(test_rdd),RDD)== True,“错误的返回类型:函数必须返回RDD”

     在find_collinear(rdd)中

        1 def find_collinear(rdd):   ----> 2 test_cartesian = get_cartesian(rdd).map(lambda c:find_slope(c))。groupByKey()。filter(lambda t:len(t [1])> =   2).flatMapValues(lambda x:x).filter(lambda a:   a).groupByKey()。mapValues(list).collect()         3 formatted = [范围(len(test_cartesian))中i的format_result(test_cartesian [i])]         4导入itertools         5出= list(itertools.chain(*格式化))

     

〜/ anaconda3 / lib / python3.6 / site-packages / pyspark / rdd.py在   收集(个体经营)       814“”“       815与SCCallSiteSync(self.context)作为CSS:    - > 816 sock_info = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())       817返回列表(_load_from_socket(sock_info,self._jrdd_deserializer))       818

     

〜/ anaconda3 / lib / python3.6 / site-packages / py4j / java_gateway.py在   的呼叫(个体,*参数)1255答案= self.gateway_client.send_command(命令)1256 RETURN_VALUE   = get_return_value(   -> 1257 answer,self.gateway_client,self.target_id,self.name)1258 1259 for temp_args中的temp_arg:

     

〜/ anaconda3 / lib / python3.6 / site-packages / py4j / protocol.py在   get_return_value(回答,gateway_client,target_id,名)       第326章       327“调用{0} {1} {2}时发生错误。\ n”。   -> 328格式(target_id,“。”,名称),值)       329其他:       330提高Py4JError(

     

Py4JJavaError:调用时发生错误   z:org.apache.spark.api.python.PythonRDD.collectAndServe。 :   org.apache.spark.SparkException:由于阶段失败,作业中止了:   10.0阶段中的任务2失败1次,最近一次失败:丢失的任务   在阶段10.0中的2.0(TID 94,本地主机,执行程序驱动程序):org.apache.spark.api.python.Python.PythonException:追溯(最新   最后调用):文件   “ /home/kriti/Downloads/spark-2.4.3-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py”,   377行,在主要       process()文件“ /home/kriti/Downloads/spark-2.4.3-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py”,   372行,进行中       serializer.dump_stream(func(split_index,iterator),outfile)文件   “ /home/kriti/anaconda3/lib/python3.6/site-packages/pyspark/rdd.py”,   第2499行,在pipeline_func中       返回func(split,prev_func(split,iterator))文件“ /home/kriti/anaconda3/lib/python3.6/site-packages/pyspark/rdd.py”,   第2499行,在pipeline_func中       返回func(split,prev_func(split,iterator))文件“ /home/kriti/anaconda3/lib/python3.6/site-packages/pyspark/rdd.py”,   352行,在func中       返回f(iterator)文件“ /home/kriti/anaconda3/lib/python3.6/site-packages/pyspark/rdd.py”,   1945年的线,在联合       merge.mergeValues(iterator)文件“ /home/kriti/Downloads/spark-2.4.3-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/shuffle.py”,   第238行,在mergeValues中       对于迭代器中的k,v:文件“ /home/kriti/Downloads/spark-2.4.3-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/util.py”,   包装中的第99行       在文件中的第2行,返回f(* args,** kwargs)文件“”   “”,第7行,在find_slope TypeError中:   -:'tuple'和'tuple'不受支持的操作数类型

     在

  org.apache.spark.api.python.BasePythonRunner $ ReaderIterator.handlePythonException(PythonRunner.scala:452)     在   org.apache.spark.api.python.PythonRunner $$ anon $ 1.read(PythonRunner.scala:588)     在   org.apache.spark.api.python.PythonRunner $$ anon $ 1.read(PythonRunner.scala:571)     在   org.apache.spark.api.python.BasePythonRunner $ ReaderIterator.hasNext(PythonRunner.scala:406)     在   org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)     在   scala.collection.Iterator $ GroupedIterator.fill(Iterator.scala:1124)     在   scala.collection.Iterator $ GroupedIterator.hasNext(Iterator.scala:1130)     在scala.collection.Iterator $$ anon $ 11.hasNext(Iterator.scala:409)在   org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)     在   org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)     在   org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)     在org.apache.spark.scheduler.Task.run(Task.scala:121)在   org.apache.spark.executor.Executor $ TaskRunner $$ anonfun $ 10.apply(Executor.scala:408)     在org.apache.spark.util.Utils $ .tryWithSafeFinally(Utils.scala:1360)     在   org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:414)     在   java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)     在   java.util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:624)     在java.lang.Thread.run(Thread.java:748)

     

驱动程序堆栈跟踪:位于   org.apache.spark.scheduler.DAGScheduler.org $ apache $ spark $ scheduler $ DAGScheduler $$ failJobAndIndependentStages(DAGScheduler.scala:1889)     在   org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.apply(DAGScheduler.scala:1877)     在   org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.apply(DAGScheduler.scala:1876)     在   scala.collection.mutable.ResizableArray $ class.foreach(ResizableArray.scala:59)     在scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)     在   org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876)     在   org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskSetFailed $ 1.apply(DAGScheduler.scala:926)     在   org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskSetFailed $ 1.apply(DAGScheduler.scala:926)     在scala.Option.foreach(Option.scala:257)在   org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)     在   org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110)     在   org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059)     在   org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048)     在org.apache.spark.util.EventLoop $$ anon $ 1.run(EventLoop.scala:49)     在   org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)     在org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)处   org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)在   org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)在   org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)在   org.apache.spark.rdd.RDD $$ anonfun $ collect $ 1.apply(RDD.scala:945)在   org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:151)     在   org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:112)     在org.apache.spark.rdd.RDD.withScope(RDD.scala:363)在   org.apache.spark.rdd.RDD.collect(RDD.scala:944)在   org.apache.spark.api.python.PythonRDD $ .collectAndServe(PythonRDD.scala:166)     在   org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)     在sun.reflect.NativeMethodAccessorImpl.invoke0(本机方法)处   sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)     在   sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)     在java.lang.reflect.Method.invoke(Method.java:498)在   py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)在   py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)在   py4j.Gateway.invoke(Gateway.java:282)在   py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)     在py4j.commands.CallCommand.execute(CallCommand.java:79)处   py4j.GatewayConnection.run(GatewayConnection.java:238)在   java.lang.Thread.run(Thread.java:748)由以下原因引起:   org.apache.spark.api.python.PythonException:追溯(最新   最后调用):文件   “ /home/kriti/Downloads/spark-2.4.3-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py”,   377行,在主要       process()文件“ /home/kriti/Downloads/spark-2.4.3-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py”,   372行,进行中       serializer.dump_stream(func(split_index,iterator),outfile)文件   “ /home/kriti/anaconda3/lib/python3.6/site-packages/pyspark/rdd.py”,   第2499行,在pipeline_func中       返回func(split,prev_func(split,iterator))文件“ /home/kriti/anaconda3/lib/python3.6/site-packages/pyspark/rdd.py”,   第2499行,在pipeline_func中       返回func(split,prev_func(split,iterator))文件“ /home/kriti/anaconda3/lib/python3.6/site-packages/pyspark/rdd.py”,   352行,在func中       返回f(iterator)文件“ /home/kriti/anaconda3/lib/python3.6/site-packages/pyspark/rdd.py”,   1945年的线,在联合       merge.mergeValues(iterator)文件“ /home/kriti/Downloads/spark-2.4.3-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/shuffle.py”,   第238行,在mergeValues中       对于迭代器中的k,v:文件“ /home/kriti/Downloads/spark-2.4.3-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/util.py”,   包装中的第99行       在文件中的第2行,返回f(* args,** kwargs)文件“”   “”,第7行,在find_slope TypeError中:   -:'tuple'和'tuple'不受支持的操作数类型

     在

  org.apache.spark.api.python.BasePythonRunner $ ReaderIterator.handlePythonException(PythonRunner.scala:452)     在   org.apache.spark.api.python.PythonRunner $$ anon $ 1.read(PythonRunner.scala:588)     在   org.apache.spark.api.python.PythonRunner $$ anon $ 1.read(PythonRunner.scala:571)     在   org.apache.spark.api.python.BasePythonRunner $ ReaderIterator.hasNext(PythonRunner.scala:406)     在   org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)     在   scala.collection.Iterator $ GroupedIterator.fill(Iterator.scala:1124)     在   scala.collection.Iterator $ GroupedIterator.hasNext(Iterator.scala:1130)     在scala.collection.Iterator $$ anon $ 11.hasNext(Iterator.scala:409)在   org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)     在   org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)     在   org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)     在org.apache.spark.scheduler.Task.run(Task.scala:121)在   org.apache.spark.executor.Executor $ TaskRunner $$ anonfun $ 10.apply(Executor.scala:408)     在org.apache.spark.util.Utils $ .tryWithSafeFinally(Utils.scala:1360)     在   org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:414)     在   java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)     在   java.util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:624)     ...还有1个

任何人都可以知道为什么会这样吗?

0 个答案:

没有答案