根据节点pyspark的数量拆分数据框

时间:2019-07-25 10:13:50

标签: python-3.x pyspark nodes cluster-computing

我正在尝试根据(集群的)节点数来拆分数据框,

我的数据框看起来像:

originalDF

如果我有node = 2,并且dataframe.count = 7:

desiredDF

因此,要应用迭代方法,分割的结果将是:

Node 1 -> 2 lines , Node 2 -> 2 lines , Node 1 -> 1 line (1 is the results of modulo )

我的问题是:我该怎么做?

2 个答案:

答案 0 :(得分:0)

您可以使用rdd分区功能之一来做到这一点(请看下面的代码),但我不建议这样做
只要您不完全了解自己在做什么以及这么做的原因。通常(对于大多数用例而言更好),最好让spark处理数据分发。

import pyspark.sql.functions as F
import itertools
import math

#creating a random dataframe
l = [(x,x+2) for x in range(1009)]

columns = ['one', 'two']

df=spark.createDataFrame(l, columns)

#create on partition to asign a partition key
df = df.coalesce(1)

#number of nodes (==partitions)
pCount = 5

#creating a list of partition keys
#basically it repeats range(5) several times until we have enough keys for each row
partitionKey = list(itertools.chain.from_iterable(itertools.repeat(x, math.ceil(df.count()/pCount)) for x in range(pCount)))

#now we can distribute the data to the partitions
df = df.rdd.partitionBy(pCount, partitionFunc = lambda x: partitionKey.pop()).toDF()

#This shows us the number of records within each partition
df.withColumn("partition_id", F.spark_partition_id()).groupBy("partition_id").count().show()

输出:

+------------+-----+ 
|partition_id|count| 
+------------+-----+ 
|           1|  202| 
|           3|  202| 
|           4|  202| 
|           2|  202| 
|           0|  201| 
+------------+-----+

答案 1 :(得分:0)

错误文字:

19/07/26 16:38:08警告TaskSetManager:在阶段10.0(TID 10,本地主机,执行程序驱动程序)中丢失任务0.0:org.apache.spark.api.python.PythonException:Traceback(最近一次调用为最后一次) ):   主文件377行中的文件“ C:\ spark \ spark-2.4.3-bin-hadoop2.7 \ python \ lib \ pyspark.zip \ pyspark \ worker.py”   正在处理文件“ C:\ spark \ spark-2.4.3-bin-hadoop2.7 \ python \ lib \ pyspark.zip \ pyspark \ worker.py”,第372行   dump_stream中的文件“ C:\ spark \ spark-2.4.3-bin-hadoop2.7 \ python \ lib \ pyspark.zip \ pyspark \ serializers.py”,第141行     对于迭代器中的obj:   文件“ C:\ Users \ AppData \ Local \ Programs \ Python \ Python37-32 \ lib \ site-packages \ pyspark \ rdd.py”,行1771,在add_shuffle_key中     对于迭代器中的k,v: ValueError:没有足够的值可解包(预期2,得到1)

at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:452)
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:588)
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:571)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:406)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:1124)
at scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:1130)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

19/07/26 16:38:08错误TaskSetManager:阶段10.0中的任务0失败1次;放弃工作 追溯(最近一次通话):   文件“ C:/Users//Desktop/TestSpark/Test.py”,第29行,在     SplitDataFrame(dataFrame,nbre_node)   SplitDataFrame中的文件“ C:\ Users \ Desktop \ TestSpark \ WriteDistributed.py”,行114     dataFrame = sqlCtx.createDataFrame(dataFrame).show()   在createDataFrame中的第307行中的文件“ C:\ Users \ AppData \ Local \ Programs \ Python \ Python37-32 \ lib \ site-packages \ pyspark \ sql \ context.py”     返回self.sparkSession.createDataFrame(数据,模式,sampleRatio,verifySchema)   在createDataFrame中的文件“ C:\ Users \ AppData \ Local \ Programs \ Python \ Python37-32 \ lib \ site-packages \ pyspark \ sql \ session.py”,行746     rdd,schema = self._createFromRDD(data.map(prepare),schema,samplerRatio)   _createFromRDD中的第390行的文件“ C:\ Users \ AppData \ Local \ Programs \ Python \ Python37-32 \ lib \ site-packages \ pyspark \ sql \ session.py”     struct = self._inferSchema(rdd,sampleRatio,names = schema)   _inferSchema中的文件“ C:\ Users \ AppData \ Local \ Programs \ Python \ Python37-32 \ lib \ site-packages \ pyspark \ sql \ session.py”,第361行     首先= rdd.first()   文件“ C:\ Users \ AppData \ Local \ Programs \ Python \ Python37-32 \ lib \ site-packages \ pyspark \ rdd.py”,第1378行     rs = self.take(1)   1360行中的文件“ C:\ Users \ AppData \ Local \ Programs \ Python \ Python37-32 \ lib \ site-packages \ pyspark \ rdd.py”     res = self.context.runJob(self,takeUpToNumLeft,p)   在runJob中的文件“ C:\ Users \ AppData \ Local \ Programs \ Python \ Python37-32 \ lib \ site-packages \ pyspark \ context.py”,行1069     sock_info = self._jvm.PythonRDD.runJob(self._jsc.sc(),mappedRDD._jrdd,分区)   调用中的文件“ C:\ Users \ AppData \ Local \ Programs \ Python \ Python37-32 \ lib \ site-packages \ py4j \ java_gateway.py”,行1257     答案,self.gateway_client,self.target_id,self.name)   装饰中的文件“ C:\ Users \ AppData \ Local \ Programs \ Python \ Python37-32 \ lib \ site-packages \ pyspark \ sql \ utils.py”,第63行     返回f(* a,** kw)   文件“ C:\ Users \ AppData \ Local \ Programs \ Python \ Python37-32 \ lib \ site-packages \ py4j \ protocol.py”,第328行,位于get_return_value中     格式(target_id,“。”,名称),值) py4j.protocol.Py4JJavaError:调用z:org.apache.spark.api.python.PythonRDD.runJob时发生错误。 :org.apache.spark.SparkException:作业由于阶段失败而中止:阶段10.0中的任务0失败1次,最近一次失败:阶段10.0中的任务0.0丢失(TID 10,本地主机,执行程序驱动程序):org.apache.spark .api.python.PythonException:追溯(最近一次呼叫过去):   主文件377行中的文件“ C:\ spark \ spark-2.4.3-bin-hadoop2.7 \ python \ lib \ pyspark.zip \ pyspark \ worker.py”   正在处理文件“ C:\ spark \ spark-2.4.3-bin-hadoop2.7 \ python \ lib \ pyspark.zip \ pyspark \ worker.py”,第372行   dump_stream中的文件“ C:\ spark \ spark-2.4.3-bin-hadoop2.7 \ python \ lib \ pyspark.zip \ pyspark \ serializers.py”,第141行     对于迭代器中的obj:   文件“ C:\ Users。\ AppData \ Local \ Programs \ Python \ Python37-32 \ lib \ site-packages \ pyspark \ rdd.py”,行1771,在add_shuffle_key中     对于迭代器中的k,v: ValueError:没有足够的值可解包(预期2,得到1)

at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:452)
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:588)
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:571)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:406)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:1124)
at scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:1130)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

驱动程序堆栈跟踪:     在org.apache.spark.scheduler.DAGScheduler.org $ apache $ spark $ scheduler $ DAGScheduler $$ failJobAndIndependentStages(DAGScheduler.scala:1889)中     在org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.apply(DAGScheduler.scala:1877)     在org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.apply(DAGScheduler.scala:1876)     在scala.collection.mutable.ResizableArray $ class.foreach(ResizableArray.scala:59)     在scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)     在org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876)     位于org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskSetFailed $ 1.apply(DAGScheduler.scala:926)     位于org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskSetFailed $ 1.apply(DAGScheduler.scala:926)     在scala.Option.foreach(Option.scala:257)     在org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)     在org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110)     在org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059)     在org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048)     在org.apache.spark.util.EventLoop $$ anon $ 1.run(EventLoop.scala:49)     在org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)     在org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)     在org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)     在org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)     在org.apache.spark.api.python.PythonRDD $ .runJob(PythonRDD.scala:153)     在org.apache.spark.api.python.PythonRDD.runJob(PythonRDD.scala)     在sun.reflect.NativeMethodAccessorImpl.invoke0(本机方法)处     在sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)     在sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)     在java.lang.reflect.Method.invoke(Method.java:498)     在py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)     在py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)     在py4j.Gateway.invoke(Gateway.java:282)     在py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)     在py4j.commands.CallCommand.execute(CallCommand.java:79)     在py4j.GatewayConnection.run(GatewayConnection.java:238)     在java.lang.Thread.run(Thread.java:748) 由以下原因引起:org.apache.spark.api.python.PythonException:追溯(最近一次呼叫过去):   主文件377行中的文件“ C:\ spark \ spark-2.4.3-bin-hadoop2.7 \ python \ lib \ pyspark.zip \ pyspark \ worker.py”   正在处理文件“ C:\ spark \ spark-2.4.3-bin-hadoop2.7 \ python \ lib \ pyspark.zip \ pyspark \ worker.py”,第372行   dump_stream中的文件“ C:\ spark \ spark-2.4.3-bin-hadoop2.7 \ python \ lib \ pyspark.zip \ pyspark \ serializers.py”,第141行     对于迭代器中的obj:   文件“ C:\ Users \ nbenmahm \ AppData \ Local \ Programs \ Python \ Python37-32 \ lib \ site-packages \ pyspark \ rdd.py”,行1771,在add_shuffle_key中     对于迭代器中的k,v: ValueError:没有足够的值可解包(预期2,得到1)

at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:452)
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:588)
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:571)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:406)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:1124)
at scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:1130)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
... 1 more

以退出代码1完成的过程