我使用pyspark来处理数据。数据如下:
8611060350280948828b33be803 4363 2017-10-01
8611060350280948828b33be803 4363 2017-10-02
4e5556e536714363b195eb8f88becbf8 365 2017-10-01
4e5556e536714363b195eb8f88becbf8 365 2017-10-02
4e5556e536714363b195eb8f88becbf8 365 2017-10-03
4e5556e536714363b195eb8f88becbf8 365 2017-10-04
我创建了一个用于存储这些数据的类。代码如下:
class LogInfo:
def __init__(self, session_id, sku_id, request_tm):
self.session_id = session_id
self.sku_id = sku_id
self.request_tm = request_tm
交易代码如下:
from classFile import LogInfo
from pyspark import SparkContext, SparkConf
conf = SparkConf().setMaster("local[*]")
sc = SparkContext(conf=conf)
orgData = sc.textFile(<dataPath>)
readyData = orgData.map(lambda x: x.split('\t')).\
filter(lambda x: x[0].strip() != "" and x[1].strip() != "" and x[2].strip() != "").\
map(lambda x: LogInfo(x[0], x[1], x[2])).groupBy(lambda x: x.session_id).\
filter(lambda x: len(x[1]) > 3).filter(lambda x: len(x[1]) < 20).\
map(lambda x: x[1]).sortBy(lambda x:x.request_tm).map(lambda x: x.sku_id)
但是这些代码并不起作用。错误信息如下:
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "D:\spark-2.2.0-bin-hadoop2.7\spark-2.2.0-bin-
hadoop2.7\python\lib\pyspark.zip\pyspark\worker.py", line 177, in main
File "D:\spark-2.2.0-bin-hadoop2.7\spark-2.2.0-bin-
hadoop2.7\python\lib\pyspark.zip\pyspark\worker.py", line 172, in process
File "D:\spark-2.2.0-bin-hadoop2.7\spark-2.2.0-bin-
hadoop2.7\python\pyspark\rdd.py", line 2423, in pipeline_func
return func(split, prev_func(split, iterator))
File "D:\spark-2.2.0-bin-hadoop2.7\spark-2.2.0-bin-hadoop2.7\python\pyspark\rdd.py", line 2423, in pipeline_func
return func(split, prev_func(split, iterator))
File "D:\spark-2.2.0-bin-hadoop2.7\spark-2.2.0-bin-hadoop2.7\python\pyspark\rdd.py", line 2423, in pipeline_func
return func(split, prev_func(split, iterator))
File "D:\spark-2.2.0-bin-hadoop2.7\spark-2.2.0-bin-hadoop2.7\python\pyspark\rdd.py", line 346, in func
return f(iterator)
File "D:\spark-2.2.0-bin-hadoop2.7\spark-2.2.0-bin-hadoop2.7\python\pyspark\rdd.py", line 1041, in <lambda>
return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
File "D:\spark-2.2.0-bin-hadoop2.7\spark-2.2.0-bin-hadoop2.7\python\pyspark\rdd.py", line 1041, in <genexpr>
return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
File "D:\spark-2.2.0-bin-hadoop2.7\spark-2.2.0-bin-hadoop2.7\python\pyspark\rdd.py", line 2053, in <lambda>
return self.map(lambda x: (f(x), x))
File
"D:<filePath>", line 15, in <lambda>
map(lambda x: x[1]).sortBy(lambda x:x.request_tm).map(lambda x: x.sku_id)
AttributeError: 'ResultIterable' object has no attribute 'request_tm'
at
org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>
(PythonRDD.scala:234)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
[Stage 1:> (0 + 5) /
10]17/12/01 17:54:15 WARN TaskSetManager: Lost task 3.0 in stage 1.0 (TID 13, localhost, executor driver): org.apache.spark.api.python.PythonException:
Traceback (most recent call last):
File "D:\spark-2.2.0-bin-hadoop2.7\spark-2.2.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\worker.py", line 177, in main
File "D:\spark-2.2.0-bin-hadoop2.7\spark-2.2.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\worker.py", line 172, in process
File "D:\spark-2.2.0-bin-hadoop2.7\spark-2.2.0-bin-hadoop2.7\python\pyspark\rdd.py", line 2423, in pipeline_func
return func(split, prev_func(split, iterator))
File "D:\spark-2.2.0-bin-hadoop2.7\spark-2.2.0-bin-hadoop2.7\python\pyspark\rdd.py", line 2423, in pipeline_func
return func(split, prev_func(split, iterator))
File "D:\spark-2.2.0-bin-hadoop2.7\spark-2.2.0-bin-hadoop2.7\python\pyspark\rdd.py", line 2423, in pipeline_func
return func(split, prev_func(split, iterator))
File "D:\spark-2.2.0-bin-hadoop2.7\spark-2.2.0-bin-hadoop2.7\python\pyspark\rdd.py", line 346, in func
return f(iterator)
File "D:\spark-2.2.0-bin-hadoop2.7\spark-2.2.0-bin-hadoop2.7\python\pyspark\rdd.py", line 1041, in <lambda>
return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
File "D:\spark-2.2.0-bin-hadoop2.7\spark-2.2.0-bin-hadoop2.7\python\pyspark\rdd.py", line 1041, in <genexpr>
return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
File "D:\spark-2.2.0-bin-hadoop2.7\spark-2.2.0-bin-hadoop2.7\python\pyspark\rdd.py", line 2053, in <lambda>
return self.map(lambda x: (f(x), x))
File
"D:<filePath>", line 15, in <lambda>
map(lambda x: x[1]).sortBy(lambda x:x.request_tm).map(lambda x: x.sku_id)
AttributeError: 'ResultIterable' object has no attribute 'request_tm'
........
我认为主要的错误信息如上所述。我无法弄清楚我犯错的地方。有人可以帮忙吗?非常感谢你!
答案 0 :(得分:0)
我认为你需要替换它:
map(lambda x: x[1])
用这个:
flatMap(lambda x: list(x[1]))
基本上,在groupBy之后,x [1]是一个“Result Iterable ”对象,所以如果你想对它的每一个元素进行排序,你首先需要对它进行flaten。
编辑: 如果您需要rdd中的sku_id列表,那么:
.map(lambda x: [y.sku_id for y in sorted(list(x[1]), key=lambda x: x.request_tm)])