我正在尝试根据(集群的)节点数来拆分数据框,
我的数据框看起来像:
如果我有node = 2,并且dataframe.count = 7:
因此,要应用迭代方法,分割的结果将是:
我的问题是:我该怎么做?
答案 0 :(得分:0)
您可以使用rdd分区功能之一来做到这一点(请看下面的代码),但我不建议这样做
只要您不完全了解自己在做什么以及这么做的原因。通常(对于大多数用例而言更好),最好让spark处理数据分发。
import pyspark.sql.functions as F
import itertools
import math
#creating a random dataframe
l = [(x,x+2) for x in range(1009)]
columns = ['one', 'two']
df=spark.createDataFrame(l, columns)
#create on partition to asign a partition key
df = df.coalesce(1)
#number of nodes (==partitions)
pCount = 5
#creating a list of partition keys
#basically it repeats range(5) several times until we have enough keys for each row
partitionKey = list(itertools.chain.from_iterable(itertools.repeat(x, math.ceil(df.count()/pCount)) for x in range(pCount)))
#now we can distribute the data to the partitions
df = df.rdd.partitionBy(pCount, partitionFunc = lambda x: partitionKey.pop()).toDF()
#This shows us the number of records within each partition
df.withColumn("partition_id", F.spark_partition_id()).groupBy("partition_id").count().show()
输出:
+------------+-----+
|partition_id|count|
+------------+-----+
| 1| 202|
| 3| 202|
| 4| 202|
| 2| 202|
| 0| 201|
+------------+-----+
答案 1 :(得分:0)
错误文字:
19/07/26 16:38:08警告TaskSetManager:在阶段10.0(TID 10,本地主机,执行程序驱动程序)中丢失任务0.0:org.apache.spark.api.python.PythonException:Traceback(最近一次调用为最后一次) ): 主文件377行中的文件“ C:\ spark \ spark-2.4.3-bin-hadoop2.7 \ python \ lib \ pyspark.zip \ pyspark \ worker.py” 正在处理文件“ C:\ spark \ spark-2.4.3-bin-hadoop2.7 \ python \ lib \ pyspark.zip \ pyspark \ worker.py”,第372行 dump_stream中的文件“ C:\ spark \ spark-2.4.3-bin-hadoop2.7 \ python \ lib \ pyspark.zip \ pyspark \ serializers.py”,第141行 对于迭代器中的obj: 文件“ C:\ Users \ AppData \ Local \ Programs \ Python \ Python37-32 \ lib \ site-packages \ pyspark \ rdd.py”,行1771,在add_shuffle_key中 对于迭代器中的k,v: ValueError:没有足够的值可解包(预期2,得到1)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:452)
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:588)
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:571)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:406)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:1124)
at scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:1130)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
19/07/26 16:38:08错误TaskSetManager:阶段10.0中的任务0失败1次;放弃工作 追溯(最近一次通话): 文件“ C:/Users//Desktop/TestSpark/Test.py”,第29行,在 SplitDataFrame(dataFrame,nbre_node) SplitDataFrame中的文件“ C:\ Users \ Desktop \ TestSpark \ WriteDistributed.py”,行114 dataFrame = sqlCtx.createDataFrame(dataFrame).show() 在createDataFrame中的第307行中的文件“ C:\ Users \ AppData \ Local \ Programs \ Python \ Python37-32 \ lib \ site-packages \ pyspark \ sql \ context.py” 返回self.sparkSession.createDataFrame(数据,模式,sampleRatio,verifySchema) 在createDataFrame中的文件“ C:\ Users \ AppData \ Local \ Programs \ Python \ Python37-32 \ lib \ site-packages \ pyspark \ sql \ session.py”,行746 rdd,schema = self._createFromRDD(data.map(prepare),schema,samplerRatio) _createFromRDD中的第390行的文件“ C:\ Users \ AppData \ Local \ Programs \ Python \ Python37-32 \ lib \ site-packages \ pyspark \ sql \ session.py” struct = self._inferSchema(rdd,sampleRatio,names = schema) _inferSchema中的文件“ C:\ Users \ AppData \ Local \ Programs \ Python \ Python37-32 \ lib \ site-packages \ pyspark \ sql \ session.py”,第361行 首先= rdd.first() 文件“ C:\ Users \ AppData \ Local \ Programs \ Python \ Python37-32 \ lib \ site-packages \ pyspark \ rdd.py”,第1378行 rs = self.take(1) 1360行中的文件“ C:\ Users \ AppData \ Local \ Programs \ Python \ Python37-32 \ lib \ site-packages \ pyspark \ rdd.py” res = self.context.runJob(self,takeUpToNumLeft,p) 在runJob中的文件“ C:\ Users \ AppData \ Local \ Programs \ Python \ Python37-32 \ lib \ site-packages \ pyspark \ context.py”,行1069 sock_info = self._jvm.PythonRDD.runJob(self._jsc.sc(),mappedRDD._jrdd,分区) 调用中的文件“ C:\ Users \ AppData \ Local \ Programs \ Python \ Python37-32 \ lib \ site-packages \ py4j \ java_gateway.py”,行1257 答案,self.gateway_client,self.target_id,self.name) 装饰中的文件“ C:\ Users \ AppData \ Local \ Programs \ Python \ Python37-32 \ lib \ site-packages \ pyspark \ sql \ utils.py”,第63行 返回f(* a,** kw) 文件“ C:\ Users \ AppData \ Local \ Programs \ Python \ Python37-32 \ lib \ site-packages \ py4j \ protocol.py”,第328行,位于get_return_value中 格式(target_id,“。”,名称),值) py4j.protocol.Py4JJavaError:调用z:org.apache.spark.api.python.PythonRDD.runJob时发生错误。 :org.apache.spark.SparkException:作业由于阶段失败而中止:阶段10.0中的任务0失败1次,最近一次失败:阶段10.0中的任务0.0丢失(TID 10,本地主机,执行程序驱动程序):org.apache.spark .api.python.PythonException:追溯(最近一次呼叫过去): 主文件377行中的文件“ C:\ spark \ spark-2.4.3-bin-hadoop2.7 \ python \ lib \ pyspark.zip \ pyspark \ worker.py” 正在处理文件“ C:\ spark \ spark-2.4.3-bin-hadoop2.7 \ python \ lib \ pyspark.zip \ pyspark \ worker.py”,第372行 dump_stream中的文件“ C:\ spark \ spark-2.4.3-bin-hadoop2.7 \ python \ lib \ pyspark.zip \ pyspark \ serializers.py”,第141行 对于迭代器中的obj: 文件“ C:\ Users。\ AppData \ Local \ Programs \ Python \ Python37-32 \ lib \ site-packages \ pyspark \ rdd.py”,行1771,在add_shuffle_key中 对于迭代器中的k,v: ValueError:没有足够的值可解包(预期2,得到1)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:452)
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:588)
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:571)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:406)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:1124)
at scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:1130)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
驱动程序堆栈跟踪: 在org.apache.spark.scheduler.DAGScheduler.org $ apache $ spark $ scheduler $ DAGScheduler $$ failJobAndIndependentStages(DAGScheduler.scala:1889)中 在org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.apply(DAGScheduler.scala:1877) 在org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.apply(DAGScheduler.scala:1876) 在scala.collection.mutable.ResizableArray $ class.foreach(ResizableArray.scala:59) 在scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) 在org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876) 位于org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskSetFailed $ 1.apply(DAGScheduler.scala:926) 位于org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskSetFailed $ 1.apply(DAGScheduler.scala:926) 在scala.Option.foreach(Option.scala:257) 在org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926) 在org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110) 在org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059) 在org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048) 在org.apache.spark.util.EventLoop $$ anon $ 1.run(EventLoop.scala:49) 在org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737) 在org.apache.spark.SparkContext.runJob(SparkContext.scala:2061) 在org.apache.spark.SparkContext.runJob(SparkContext.scala:2082) 在org.apache.spark.SparkContext.runJob(SparkContext.scala:2101) 在org.apache.spark.api.python.PythonRDD $ .runJob(PythonRDD.scala:153) 在org.apache.spark.api.python.PythonRDD.runJob(PythonRDD.scala) 在sun.reflect.NativeMethodAccessorImpl.invoke0(本机方法)处 在sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 在sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 在java.lang.reflect.Method.invoke(Method.java:498) 在py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) 在py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) 在py4j.Gateway.invoke(Gateway.java:282) 在py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) 在py4j.commands.CallCommand.execute(CallCommand.java:79) 在py4j.GatewayConnection.run(GatewayConnection.java:238) 在java.lang.Thread.run(Thread.java:748) 由以下原因引起:org.apache.spark.api.python.PythonException:追溯(最近一次呼叫过去): 主文件377行中的文件“ C:\ spark \ spark-2.4.3-bin-hadoop2.7 \ python \ lib \ pyspark.zip \ pyspark \ worker.py” 正在处理文件“ C:\ spark \ spark-2.4.3-bin-hadoop2.7 \ python \ lib \ pyspark.zip \ pyspark \ worker.py”,第372行 dump_stream中的文件“ C:\ spark \ spark-2.4.3-bin-hadoop2.7 \ python \ lib \ pyspark.zip \ pyspark \ serializers.py”,第141行 对于迭代器中的obj: 文件“ C:\ Users \ nbenmahm \ AppData \ Local \ Programs \ Python \ Python37-32 \ lib \ site-packages \ pyspark \ rdd.py”,行1771,在add_shuffle_key中 对于迭代器中的k,v: ValueError:没有足够的值可解包(预期2,得到1)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:452)
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:588)
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:571)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:406)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:1124)
at scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:1130)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
... 1 more
以退出代码1完成的过程