Spark:List Tuple的长度

时间:2015-07-03 13:51:10

标签: apache-spark pyspark

我的代码出了什么问题?

idAndNumbers = ((1,(1,2,3)))
irRDD = sc.parallelize(idAndNumbers)
irLengthRDD = irRDD.map(lambda x:x[1].length).collect()

得到一堆错误,如:

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.:org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 88.0 failed 1 times, most recent failure: Lost task 0.0 in stage 88.0 (TID 88, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last):

修改

完整追踪:

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 88.0 failed 1 times, most recent failure: Lost task 0.0 in stage 88.0 (TID 88, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/pyspark/worker.py", line 101, in main
    process()
  File "/usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/pyspark/worker.py", line 96, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/pyspark/serializers.py", line 236, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "<ipython-input-79-ef1d5a130db5>", line 12, in <lambda>
TypeError: 'int' object has no attribute '__getitem__'

    at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:135)
    at org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:176)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:94)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
    at org.apache.spark.scheduler.Task.run(Task.scala:64)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

编辑2:

事实证明它确实是一个嵌套的元组我正在处理的事情如下:((1,(1,2,3)))

2 个答案:

答案 0 :(得分:0)

>>> ian = [(1,(1,2,3))]
>>> p = sc.parallelize(ian)
>>> l = p.map(lambda x: len(x[1]))
>>> print l.collect()

[3]

你需要使用len.Tuple没有任何名为length

的东西

答案 1 :(得分:0)

同意ayan guha,你可以输入help(len)来查看以下信息:

//Header guard
#ifndef V2_BURRITO_H    //If this header has not already been included in main.cpp
#define V2_BURRITO_H    //Then include the following lines of code

class Burrito   //Creating a class named 'Burrito'
    {
        //Creating a public interface
        public:
            //Creating a 'Constructor', or a way to manipulate 'private' data
            Burrito(int a); //This constructor contains 1 input in the form of an integer

            //Creating a 'Member function', another name for a function inside a class
            void setType(int a);
    };

#endif  //End of code