pySpark take()返回错误:TypeError:'int'对象不可迭代

时间:2018-09-20 11:22:07

标签: apache-spark pyspark

我正在尝试学习pySpark。 我运行以下命令:

from pyspark.sql import SparkSession

spark = SparkSession.builder \
   .master("local") \
   .appName("Linear Regression Model - 2") \
   .config("spark.executor.memory", "1gb") \
   .getOrCreate()

sc = spark.sparkContext

print(sc) 

可以,然后返回:

  

SparkContext master =本地appName =线性回归模型-2

然后,我阅读一个文本文件并显示其值:

header = sc.textFile('cal_housing.domain')
header.collect()

我从这里获得的输入文件:link

在以下输出中可以正常工作

  

[“经度:连续。”,“纬度:连续。”,   'housingMedianAge:连续的。 ','totalRooms:连续。 ',   'totalBedrooms:连续的。 ','人口:连续。 ',   '家庭:连续的。 ','medianIncome:连续。 ',   'medianHouseValue:连续的。 ']

据我所知,如果数据量很大,我们应该 take() 。但是,当我使用 take() 函数时,它将返回错误:

header.take(2)

我试图弄清楚为什么它会返回 TypeError:'int'对象不可迭代 ,但我不能。据我了解, header RDD对象,它包含一个标题列表。因此,它应该是可迭代的,而不是 int

如何解决此错误:

错误消息:

  

Py4JJavaError:调用时发生错误   z:org.apache.spark.api.python.PythonRDD.runJob。 :   org.apache.spark.SparkException:由于阶段失败,作业中止了:   阶段3.0中的任务0失败1次,最近一次失败:丢失任务0.0   在阶段3.0(TID 3,本地主机,执行程序驱动程序)中:   org.apache.spark.api.python.PythonException:追溯(最新   最后调用):文件   “ /usr/local/spark-2.2.2-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py”,   167行,在主要       func,profiler,deserializer,serializer = read_command(pickleSer,infile)文件   “ /usr/local/spark-2.2.2-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py”,   read_command中的第56行       命令= serializer._read_with_length(file)文件“ /usr/local/spark-2.2.2-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py”,   第169行,_read_with_length       返回self.loads(obj)文件“ /usr/local/spark-2.2.2-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py”,   455行,在负载中       返回pickle.loads(obj,encoding = encoding)文件“ /usr/local/spark-2.2.2-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/cloudpickle.py”,   _make_skel_func中的第784行       如果没有其他闭包,则闭包= _reconstruct_closure(闭包)   “ /usr/local/spark-2.2.2-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/cloudpickle.py”,   _reconstruct_closure中的第776行       返回元组([_中的v的_make_cell(v)])TypeError:'int'对象不可迭代

     在

  org.apache.spark.api.python.PythonRunner $$ anon $ 1.read(PythonRDD.scala:194)     在   org.apache.spark.api.python.PythonRunner $$ anon $ 1。(PythonRDD.scala:235)     在   org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:153)     在org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:64)     在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)     在org.apache.spark.rdd.RDD.iterator(RDD.scala:287)处   org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)在   org.apache.spark.scheduler.Task.run(Task.scala:109)在   org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:338)     在   java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)     在   java.util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:624)     在java.lang.Thread.run(Thread.java:748)

     

驱动程序堆栈跟踪:位于   org.apache.spark.scheduler.DAGScheduler.org $ apache $ spark $ scheduler $ DAGScheduler $$ failJobAndIndependentStages(DAGScheduler.scala:1533)     在   org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.apply(DAGScheduler.scala:1521)     在   org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.apply(DAGScheduler.scala:1520)     在   scala.collection.mutable.ResizableArray $ class.foreach(ResizableArray.scala:59)     在scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)     在   org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1520)     在   org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskSetFailed $ 1.apply(DAGScheduler.scala:814)     在   org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskSetFailed $ 1.apply(DAGScheduler.scala:814)     在scala.Option.foreach(Option.scala:257)在   org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)     在   org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1748)     在   org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1703)     在   org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1692)     在org.apache.spark.util.EventLoop $$ anon $ 1.run(EventLoop.scala:48)     在   org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)     在org.apache.spark.SparkContext.runJob(SparkContext.scala:2029)处   org.apache.spark.SparkContext.runJob(SparkContext.scala:2050)在   org.apache.spark.SparkContext.runJob(SparkContext.scala:2069)在   org.apache.spark.api.python.PythonRDD $ .runJob(PythonRDD.scala:463)在   org.apache.spark.api.python.PythonRDD.runJob(PythonRDD.scala)在   sun.reflect.GeneratedMethodAccessor54.invoke(未知源)位于   sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)     在java.lang.reflect.Method.invoke(Method.java:498)在   py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)在   py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)在   py4j.Gateway.invoke(Gateway.java:282)在   py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)     在py4j.commands.CallCommand.execute(CallCommand.java:79)处   py4j.GatewayConnection.run(GatewayConnection.java:238)在   java.lang.Thread.run(Thread.java:748)由以下原因引起:   org.apache.spark.api.python.PythonException:追溯(最新   最后调用):文件   “ /usr/local/spark-2.2.2-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py”,   167行,在主要       func,profiler,deserializer,serializer = read_command(pickleSer,infile)文件   “ /usr/local/spark-2.2.2-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py”,   read_command中的第56行       命令= serializer._read_with_length(file)文件“ /usr/local/spark-2.2.2-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py”,   第169行,_read_with_length       返回self.loads(obj)文件“ /usr/local/spark-2.2.2-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py”,   455行,在负载中       返回pickle.loads(obj,encoding = encoding)文件“ /usr/local/spark-2.2.2-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/cloudpickle.py”,   _make_skel_func中的第784行       如果没有其他闭包,则闭包= _reconstruct_closure(闭包)   “ /usr/local/spark-2.2.2-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/cloudpickle.py”,   _reconstruct_closure中的第776行       返回元组([_中的v的_make_cell(v)])TypeError:'int'对象不可迭代

     在

  org.apache.spark.api.python.PythonRunner $$ anon $ 1.read(PythonRDD.scala:194)     在   org.apache.spark.api.python.PythonRunner $$ anon $ 1。(PythonRDD.scala:235)     在   org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:153)     在org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:64)     在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)     在org.apache.spark.rdd.RDD.iterator(RDD.scala:287)处   org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)在   org.apache.spark.scheduler.Task.run(Task.scala:109)在   org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:338)     在   java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)     在   java.util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:624)     ...还有1个

0 个答案:

没有答案