计数不适用于RDD-Python,Pyspark

时间:2018-10-17 22:02:26

标签: python apache-spark pyspark mapreduce rdd

问题:我当前正在尝试读取包含Json数据的文本文件。目的是通过Json中存在的userID来计算不同的用户数。 即将出现的问题:在RDD中将count()应用于python引发火花抛出错误。

  提供给每个分区的

zeroValue与提供的分区唯一     到最后的reduce电话

代码:

步骤1:我使用

将文件读入名为rdd_step1的RDD中
sc.textFile('filepath')

第2步:创建一个函数,该函数基本上会读取每一行并在Json对象有效的数据集中返回一个Json对象。

def safe_parse(raw_json):
    try:
        jo = json.loads(raw_json)
        if jo.get('created_at'):
            return jo
        else:
            None
    except:
        return None   

第3步:我创建一个中间rdd,在其中解析字符串并获取json对象以创建(键,值)rdd

rdd_step2 = rdd_step2.map( lambda x: safe_parse(x)).filter(lambda x: x is not None).map(lambda x: (x['user']['id_str'],1))

第4步:使用reduce by Key并获取计数

counts=     rdd_step2.reduceByKey(lambda a, b: a + b)
Count = counts.count()
print(Count)

样本数据: {“ TS”:“周一1月1日00:00:00”,“ id”:“ 123213”,“文本”:“显然您不允许使用反”,“用户”:{“ id”: 4494956854,“ id_str”:“ 4494956854”,“ name”:“ Amy Smalley”,“ screen_name”:“ amyjosmalley”,“ location”:null,“ url”:null,“ description”:“全职妈妈“,” protected“:false,” verified“:false,” followers_count“:11,” friends_count“:40,” listed_count“:0:” favourites_count“:116,” statuses_count“:23,” created_at“:” Tue Dec 15 18:57:57 +0000 2015“}}

完整错误消息:

  

Py4JJavaError Traceback(最近的呼叫   最后)在()         4 #raise NotImplementedError()         5个计数= rdd_step2.reduceByKey(lambda a,b:a + b)   ----> 6 Count = counts.count()         7张(计数)         8#my_output.append(“ num-unique-users”,users_count)

     

C:/用户/ srikanth /下载/spark-2.3.1-bin-hadoop2.7/spark-2.3.1-bin-hadoop2.7\python\pyspark\rdd.py   数量(自己)1071 3 1072“”“   -> 1073 return self.mapPartitions(lambda i:[sum(i中为_的1)])。sum()1074 1075 def stats(self):

     

C:/用户/ srikanth /下载/spark-2.3.1-bin-hadoop2.7/spark-2.3.1-bin-hadoop2.7\python\pyspark\rdd.py   总计1062 6.0 1063“”“   -> 1064返回self.mapPartitions(lambda x:[sum(x)])。fold(0,operator.add)1065 1066 def count(self):

     

C:/用户/ srikanth /下载/spark-2.3.1-bin-hadoop2.7/spark-2.3.1-bin-hadoop2.7\python\pyspark\rdd.py   折叠(self,zeroValue,op)       933#zeroValue提供给每个分区的值与提供的值是唯一的       934#到最后的redcall   -> 935个vals = self.mapPartitions(func).collect()       936返回reduce(op,vals,zeroValue)       937

     

C:/用户/ srikanth /下载/spark-2.3.1-bin-hadoop2.7/spark-2.3.1-bin-hadoop2.7\python\pyspark\rdd.py   在收集(个体经营)       832“”“       833以SCCallSiteSync(self.context)作为CSS:   -> 834 sock_info = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())       835返回列表(_load_from_socket(sock_info,self._jrdd_deserializer))       836

     

〜\ Downloads \ spark-2.3.1-bin-hadoop2.7 \ spark-2.3.1-bin-hadoop2.7 \ python \ lib \ py4j-0.10.7-src.zip \ py4j \ java_gateway.py   在通话((* args)自己)1255中,答案=   self.gateway_client.send_command(命令)1256 return_value   = get_return_value(   -> 1257 answer,self.gateway_client,self.target_id,self.name)1258 1259 for temp_args中的temp_arg:

     

〜\ Downloads \ spark-2.3.1-bin-hadoop2.7 \ spark-2.3.1-bin-hadoop2.7 \ python \ lib \ py4j-0.10.7-src.zip \ py4j \ protocol.py   在get_return_value(答案,gateway_client,target_id,名称)中       第326章       327“调用{0} {1} {2}时发生错误。\ n”。   -> 328格式(target_id,“。”,名称),值)       329其他:       330提高Py4JError(

     

Py4JJavaError:调用时发生错误   z:org.apache.spark.api.python.PythonRDD.collectAndServe。 :   org.apache.spark.SparkException:由于阶段失败,作业中止了:   阶段417.0中的任务1失败1次,最近一次失败:丢失的任务   阶段417.0中的1.0(TID 462,本地主机,执行程序驱动程序):java.io.FileNotFoundException:   C:\ Users \ srikanth \ AppData \ Local \ Temp \ blockmgr-f7a58081-572f-4ea9-a53e-a84f5ebfb955 \ 0c \ temp_shuffle_36ab37dd-0f30-41c6-8f4f-ef1400d58ba8   (系统找不到指定的路径)在   java.io.FileOutputStream.open0(本机方法)位于   java.io.FileOutputStream.open(未知源),位于   java.io.FileOutputStream。(未知源)   org.apache.spark.storage.DiskBlockObjectWriter.initialize(DiskBlockObjectWriter.scala:103)     在   org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:116)     在   org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:237)     在   org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:151)     在   org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)     在   org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)     在org.apache.spark.scheduler.Task.run(Task.scala:109)处   org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:345)     在java.util.concurrent.ThreadPoolExecutor.runWorker(未知来源)     在java.util.concurrent.ThreadPoolExecutor $ Worker.run(未知来源)     在java.lang.Thread.run(未知来源)

     

驱动程序堆栈跟踪:位于   org.apache.spark.scheduler.DAGScheduler.org $ apache $ spark $ scheduler $ DAGScheduler $$ failJobAndIndependentStages(DAGScheduler.scala:1602)     在   org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.apply(DAGScheduler.scala:1590)     在   org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.apply(DAGScheduler.scala:1589)     在   scala.collection.mutable.ResizableArray $ class.foreach(ResizableArray.scala:59)     在scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)     在   org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589)     在   org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskSetFailed $ 1.apply(DAGScheduler.scala:831)     在   org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskSetFailed $ 1.apply(DAGScheduler.scala:831)     在scala.Option.foreach(Option.scala:257)在   org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)     在   org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823)     在   org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772)     在   org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761)     在org.apache.spark.util.EventLoop $$ anon $ 1.run(EventLoop.scala:48)     在   org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)     在org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)处   org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)在   org.apache.spark.SparkContext.runJob(SparkContext.scala:2074)在   org.apache.spark.SparkContext.runJob(SparkContext.scala:2099)在   org.apache.spark.rdd.RDD $$ anonfun $ collect $ 1.apply(RDD.scala:939)在   org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:151)     在   org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:112)     在org.apache.spark.rdd.RDD.withScope(RDD.scala:363)在   org.apache.spark.rdd.RDD.collect(RDD.scala:938)在   org.apache.spark.api.python.PythonRDD $ .collectAndServe(PythonRDD.scala:162)     在   org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)     在sun.reflect.GeneratedMethodAccessor72.invoke(未知来源)在   sun.reflect.DelegatingMethodAccessorImpl.invoke(未知源)位于   java.lang.reflect.Method.invoke(来源未知)   py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)在   py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)在   py4j.Gateway.invoke(Gateway.java:282)在   py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)     在py4j.commands.CallCommand.execute(CallCommand.java:79)处   py4j.GatewayConnection.run(GatewayConnection.java:238)在   java.lang.Thread.run(未知源)造成原因:   java.io.FileNotFoundException:   C:\ Users \ srikanth \ AppData \ Local \ Temp \ blockmgr-f7a58081-572f-4ea9-a53e-a84f5ebfb955 \ 0c \ temp_shuffle_36ab37dd-0f30-41c6-8f4f-ef1400d58ba8   (系统找不到指定的路径)在   java.io.FileOutputStream.open0(本机方法)位于   java.io.FileOutputStream.open(未知源),位于   java.io.FileOutputStream。(未知源)   org.apache.spark.storage.DiskBlockObjectWriter.initialize(DiskBlockObjectWriter.scala:103)     在   org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:116)     在   org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:237)     在   org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:151)     在   org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)     在   org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)     在org.apache.spark.scheduler.Task.run(Task.scala:109)处   org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:345)     在java.util.concurrent.ThreadPoolExecutor.runWorker(未知来源)     在java.util.concurrent.ThreadPoolExecutor $ Worker.run(未知来源)     ...还有1个

0 个答案:

没有答案