Question

我正在使用df.randomSplit（）但它没有分成相等的行。有没有其他方法可以实现它？

Answer 1

在我的情况下，我需要平衡（相等大小）的分区才能执行特定的交叉验证实验。

为此你通常会：

随机化数据集
应用模数运算将每个元素分配到折叠（分区）

在此步骤之后，您将必须使用filter提取每个分区，但是仍然没有转换将单个RDD分成多个。

以下是scala中的一些代码，它只使用标准的spark操作，因此应该很容易适应python：

val npartitions = 3

val foldedRDD = 
   // Map each instance with random number
   .zipWithIndex
   .map ( t => (t._1, t._2, new scala.util.Random(t._2*seed).nextInt()) )
   // Random ordering
   .sortBy( t => (t._1(m_classIndex), t._3) )
   // Assign each instance to fold
   .zipWithIndex
   .map( t => (t._1, t._2 % npartitions) )

val balancedRDDList =  
    for (f <- 0 until npartitions) 
    yield foldedRDD.filter( _._2 == f )

如何拆分具有相同记录的spark数据帧

1 个答案: