PySpark:' ResultIterable'对象没有属性' request_tm'

时间:2017-12-01 10:22:32

标签: pyspark

我使用pyspark来处理数据。数据如下:

8611060350280948828b33be803 4363    2017-10-01
8611060350280948828b33be803 4363    2017-10-02
4e5556e536714363b195eb8f88becbf8    365 2017-10-01
4e5556e536714363b195eb8f88becbf8    365 2017-10-02
4e5556e536714363b195eb8f88becbf8    365 2017-10-03
4e5556e536714363b195eb8f88becbf8    365 2017-10-04

我创建了一个用于存储这些数据的类。代码如下:

class LogInfo:
    def __init__(self, session_id, sku_id, request_tm):
        self.session_id = session_id
        self.sku_id = sku_id
        self.request_tm = request_tm

交易代码如下:

from classFile import LogInfo
from pyspark import SparkContext, SparkConf

conf = SparkConf().setMaster("local[*]")
sc = SparkContext(conf=conf)
orgData = sc.textFile(<dataPath>)
readyData = orgData.map(lambda x: x.split('\t')).\
     filter(lambda x: x[0].strip() != "" and x[1].strip() != "" and x[2].strip() != "").\
     map(lambda x: LogInfo(x[0], x[1], x[2])).groupBy(lambda x: x.session_id).\
     filter(lambda x: len(x[1]) > 3).filter(lambda x: len(x[1]) < 20).\
     map(lambda x: x[1]).sortBy(lambda x:x.request_tm).map(lambda x: x.sku_id)

但是这些代码并不起作用。错误信息如下:

     org.apache.spark.api.python.PythonException: Traceback (most recent call last):
      File "D:\spark-2.2.0-bin-hadoop2.7\spark-2.2.0-bin-
hadoop2.7\python\lib\pyspark.zip\pyspark\worker.py", line 177, in main
      File "D:\spark-2.2.0-bin-hadoop2.7\spark-2.2.0-bin-
hadoop2.7\python\lib\pyspark.zip\pyspark\worker.py", line 172, in process
      File "D:\spark-2.2.0-bin-hadoop2.7\spark-2.2.0-bin-
hadoop2.7\python\pyspark\rdd.py", line 2423, in pipeline_func
    return func(split, prev_func(split, iterator))
      File "D:\spark-2.2.0-bin-hadoop2.7\spark-2.2.0-bin-hadoop2.7\python\pyspark\rdd.py", line 2423, in pipeline_func
    return func(split, prev_func(split, iterator))
      File "D:\spark-2.2.0-bin-hadoop2.7\spark-2.2.0-bin-hadoop2.7\python\pyspark\rdd.py", line 2423, in pipeline_func
        return func(split, prev_func(split, iterator))
      File "D:\spark-2.2.0-bin-hadoop2.7\spark-2.2.0-bin-hadoop2.7\python\pyspark\rdd.py", line 346, in func
        return f(iterator)
      File "D:\spark-2.2.0-bin-hadoop2.7\spark-2.2.0-bin-hadoop2.7\python\pyspark\rdd.py", line 1041, in <lambda>
        return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
      File "D:\spark-2.2.0-bin-hadoop2.7\spark-2.2.0-bin-hadoop2.7\python\pyspark\rdd.py", line 1041, in <genexpr>
        return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
      File "D:\spark-2.2.0-bin-hadoop2.7\spark-2.2.0-bin-hadoop2.7\python\pyspark\rdd.py", line 2053, in <lambda>
        return self.map(lambda x: (f(x), x))
      File 
"D:<filePath>", line 15, in <lambda>
    map(lambda x: x[1]).sortBy(lambda x:x.request_tm).map(lambda x: x.sku_id)
AttributeError: 'ResultIterable' object has no attribute 'request_tm'
at 
org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>
(PythonRDD.scala:234)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
[Stage 1:>                                                         (0 + 5) / 
10]17/12/01 17:54:15 WARN TaskSetManager: Lost task 3.0 in stage 1.0 (TID 13, localhost, executor driver): org.apache.spark.api.python.PythonException: 
Traceback (most recent call last):
  File "D:\spark-2.2.0-bin-hadoop2.7\spark-2.2.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\worker.py", line 177, in main
  File "D:\spark-2.2.0-bin-hadoop2.7\spark-2.2.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\worker.py", line 172, in process
  File "D:\spark-2.2.0-bin-hadoop2.7\spark-2.2.0-bin-hadoop2.7\python\pyspark\rdd.py", line 2423, in pipeline_func
    return func(split, prev_func(split, iterator))
  File "D:\spark-2.2.0-bin-hadoop2.7\spark-2.2.0-bin-hadoop2.7\python\pyspark\rdd.py", line 2423, in pipeline_func
    return func(split, prev_func(split, iterator))
  File "D:\spark-2.2.0-bin-hadoop2.7\spark-2.2.0-bin-hadoop2.7\python\pyspark\rdd.py", line 2423, in pipeline_func
    return func(split, prev_func(split, iterator))
  File "D:\spark-2.2.0-bin-hadoop2.7\spark-2.2.0-bin-hadoop2.7\python\pyspark\rdd.py", line 346, in func
    return f(iterator)
  File "D:\spark-2.2.0-bin-hadoop2.7\spark-2.2.0-bin-hadoop2.7\python\pyspark\rdd.py", line 1041, in <lambda>
    return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
  File "D:\spark-2.2.0-bin-hadoop2.7\spark-2.2.0-bin-hadoop2.7\python\pyspark\rdd.py", line 1041, in <genexpr>
    return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
  File "D:\spark-2.2.0-bin-hadoop2.7\spark-2.2.0-bin-hadoop2.7\python\pyspark\rdd.py", line 2053, in <lambda>
    return self.map(lambda x: (f(x), x))
  File 
"D:<filePath>", line 15, in <lambda>
    map(lambda x: x[1]).sortBy(lambda x:x.request_tm).map(lambda x: x.sku_id)
AttributeError: 'ResultIterable' object has no attribute 'request_tm'

........

我认为主要的错误信息如上所述。我无法弄清楚我犯错的地方。有人可以帮忙吗?非常感谢你!

1 个答案:

答案 0 :(得分:0)

我认为你需要替换它:

map(lambda x: x[1])

用这个:

flatMap(lambda x: list(x[1]))

基本上,在groupBy之后,x [1]是一个“Result Iterable ”对象,所以如果你想对它的每一个元素进行排序,你首先需要对它进行flaten。

编辑: 如果您需要rdd中的sku_id列表,那么:

 .map(lambda x: [y.sku_id for y in sorted(list(x[1]), key=lambda x: x.request_tm)])