我有一个DataFrame(转换为RDD)并希望重新分区,以便每个键(第一列)都有自己的分区。这就是我所做的:
# Repartition to # key partitions and map each row to a partition given their key rank
my_rdd = df.rdd.partitionBy(len(keys), lambda row: int(row[0]))
但是,当我尝试将其映射回DataFrame或保存时,我收到此错误:
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "spark-1.5.1-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/worker.py", line 111, in main
process()
File "spark-1.5.1-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/worker.py", line 106, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "spark-1.5.1-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/serializers.py", line 133, in dump_stream
for obj in iterator:
File "spark-1.5.1-bin-hadoop2.6/python/pyspark/rdd.py", line 1703, in add_shuffle_key
for k, v in iterator:
ValueError: too many values to unpack
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:342)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
... 1 more
更多的测试显示,即使这会导致同样的错误: my_rdd = df.rdd.partitionBy(x)#x =可以是5,100等
你们有没有遇到过这个。如果是这样,你是如何解决的?
答案 0 :(得分:2)
if (rank == 0) {
for (i = 1; i < size; i++)
MPI_Send(&(Csend[0][0]), N*dim, MPI_FLOAT, i, 10+i, MPI_COMM_WORLD);
}
if (rank == i) {
MPI_Recv(&(Crecv[0][0]), N*dim, MPI_FLOAT, 0, 10+i, MPI_COMM_WORLD, &status);
}
需要一个partitionBy
,它在Python中相当于长度为2的元组(列表)的PairwiseRDD
,其中第一个元素是键,第二个元素是值。 / p>
RDD
获取密钥并将其映射到分区号。当您在partitionFunc
上使用它时,它会尝试将行解压缩为一个值并失败:
RDD[Row]
即使你提供了正确的数据,也可以这样做:
from pyspark.sql import Row
row = Row(1, 2, 3)
k, v = row
## Traceback (most recent call last):
## ...
## ValueError: too many values to unpack (expected 2)
它真的没有意义。在my_rdd = (df.rdd.map(lambda row: (int(row[0]), row)).partitionBy(len(keys))
的情况下,分区不是特别有意义。有关详细信息,请参阅my answer至How to define partitioning of DataFrame?。