Pyspark:对keyd的值应用reduce by key

时间:2019-06-19 22:30:56

标签: apache-spark pyspark rdd reduce

经过一些转换后,我得到了具有以下格式的rdd:

[(0, [('a', 1), ('b', 1), ('b', 1), ('b', 1)])

(1, [('c', 1), ('d', 1), ('h', 1), ('h', 1)])]

我不知道如何在此rdd的值部分上实质上“ reduceByKey()”。

这是我想要实现的:

[(0, [('a', 1), ('b', 3)])

(1, [('c', 1), ('d', 1), ('h', 2)])]

我最初使用.values(),然后将reduceByKey应用于结果,但是最终我丢失了原始密钥(在这种情况下为0或1)。

2 个答案:

答案 0 :(得分:1)

您丢失了原始密钥,因为org.ajax4jsf.exception.FileUploadException: IO Error parsing multipart request at org.ajax4jsf.request.MultipartRequest.parseRequest(MultipartRequest.java:388) at org.richfaces.component.FileUploadPhaselistener.beforePhase(FileUploadPhaselistener.java:63) at com.sun.faces.lifecycle.Phase.handleBeforePhase(Phase.java:201) at com.sun.faces.lifecycle.Phase.doPhase(Phase.java:74) at com.sun.faces.lifecycle.RestoreViewPhase.doPhase(RestoreViewPhase.java:109) at com.sun.faces.lifecycle.LifecycleImpl.execute(LifecycleImpl.java:177) at javax.faces.webapp.FacesServlet.executeLifecyle(FacesServlet.java:707) at javax.faces.webapp.FacesServlet.service(FacesServlet.java:451) at org.apache.catalina.core.StandardWrapper.service(StandardWrapper.java:1628) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:339) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:209) at org.ajax4jsf.webapp.BaseXMLFilter.doXmlFilter(BaseXMLFilter.java:206) at org.ajax4jsf.webapp.BaseFilter.handleRequest(BaseFilter.java:290) at org.ajax4jsf.webapp.BaseFilter.processUploadsAndHandleRequest(BaseFilter.java:367) at org.ajax4jsf.webapp.BaseFilter.doFilter(BaseFilter.java:515) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:251) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:209) at org.apache.catalina.core.ApplicationDispatcher.doInvoke(ApplicationDispatcher.java:822) at org.apache.catalina.core.ApplicationDispatcher.invoke(ApplicationDispatcher.java:688) at org.apache.catalina.core.ApplicationDispatcher.processRequest(ApplicationDispatcher.java:527) at org.apache.catalina.core.ApplicationDispatcher.doDispatch(ApplicationDispatcher.java:496) at org.apache.catalina.core.ApplicationDispatcher.dispatch(ApplicationDispatcher.java:378) at org.apache.catalina.core.StandardHostValve.custom(StandardHostValve.java:507) at org.apache.catalina.core.StandardHostValve.dispatchToErrorPage(StandardHostValve.java:701) at org.apache.catalina.core.StandardHostValve.status(StandardHostValve.java:385) at org.apache.catalina.core.StandardHostValve.throwable(StandardHostValve.java:319) at org.apache.catalina.core.StandardHostValve.postInvoke(StandardHostValve.java:217) at org.apache.catalina.connector.CoyoteAdapter.doService(CoyoteAdapter.java:373) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:238) at com.sun.enterprise.v3.services.impl.ContainerMapper$HttpHandlerCallable.call(ContainerMapper.java:520) at com.sun.enterprise.v3.services.impl.ContainerMapper.service(ContainerMapper.java:217) at org.glassfish.grizzly.http.server.HttpHandler.runService(HttpHandler.java:182) at org.glassfish.grizzly.http.server.HttpHandler.doHandle(HttpHandler.java:156) at org.glassfish.grizzly.http.server.HttpServerFilter.handleRead(HttpServerFilter.java:218) at org.glassfish.grizzly.filterchain.ExecutorResolver$9.execute(ExecutorResolver.java:95) at org.glassfish.grizzly.filterchain.DefaultFilterChain.executeFilter(DefaultFilterChain.java:260) at org.glassfish.grizzly.filterchain.DefaultFilterChain.executeChainPart(DefaultFilterChain.java:177) at org.glassfish.grizzly.filterchain.DefaultFilterChain.execute(DefaultFilterChain.java:109) at org.glassfish.grizzly.filterchain.DefaultFilterChain.process(DefaultFilterChain.java:88) at org.glassfish.grizzly.ProcessorExecutor.execute(ProcessorExecutor.java:53) at org.glassfish.grizzly.nio.transport.TCPNIOTransport.fireIOEvent(TCPNIOTransport.java:524) at org.glassfish.grizzly.strategies.AbstractIOStrategy.fireIOEvent(AbstractIOStrategy.java:89) at org.glassfish.grizzly.strategies.WorkerThreadIOStrategy.run0(WorkerThreadIOStrategy.java:94) at org.glassfish.grizzly.strategies.WorkerThreadIOStrategy.access$100(WorkerThreadIOStrategy.java:33) at org.glassfish.grizzly.strategies.WorkerThreadIOStrategy$WorkerThreadRunnable.run(WorkerThreadIOStrategy.java:114) at org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker.doWork(AbstractThreadPool.java:569) at org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker.run(AbstractThreadPool.java:549) at java.lang.Thread.run(Thread.java:748) Caused by: java.io.IOException: Request prolog cannot be read at org.ajax4jsf.request.MultipartRequest.readProlog(MultipartRequest.java:333) at org.ajax4jsf.request.MultipartRequest.initialize(MultipartRequest.java:369) at org.ajax4jsf.request.MultipartRequest.parseRequest(MultipartRequest.java:379) ... 47 more¡ëôwÍ$ 将仅连续获得.values()的值。您应该对行中的元组求和。

key-value

答案 1 :(得分:0)

尽管values给出了RDD,但是reduceByKey可以对RDD上的所有值进行逐行操作。

您也可以使用groupby(需要订购)来实现相同的目的:

from itertools import groupby

distdata.map(lambda x: (x[0], [(a, sum(c[1]  for c in b)) for a,b in groupby(sorted(x[1]), key=lambda p: p[0]) ])).collect()