一次处理每个分区的每个分区和每一行

时间:2019-09-23 14:34:14

标签: scala apache-spark hadoop hdfs hadoop-partitioning

问题:

我将以下2个数据帧存储在数组中。数据已按SECURITY_ID分区。

const [avg, positive] = getValues(); 

试用:

如何分别处理每个数据框,并且在每个数据框下,我想一次处理一行。我尝试了以下

Dataframe 1 (DF1):
+-------------+----------+----------+--------+---------+---------+
| ACC_SECURITY|ACCOUNT_NO|LONG_IND|SHORT_IND|SECURITY_ID|QUANTITY|
+-------------+----------+--------+---------+-----------+--------+
|9161530335G71|  91615303|1111    |     1000|      35G71|  -20000|
|9161530435G71|  91615304|2222    |     2000|      35G71|   -2883|
|9161530235G71|  91615302|3333    |     3000|      35G71|    2000|
|9211530135G71|  92115301|4444    |     4000|      35G71|    8003|
+-------------+----------+--------+---------+-----------+--------+

Dataframe 2 (DF2):
+-------------+----------+----------+--------+---------+---------+
| ACC_SECURITY|ACCOUNT_NO|LONG_IND|SHORT_IND|SECURITY_ID|QUANTITY|
+-------------+----------+--------+---------+-----------+--------+
|3FA34789290X2|  3FA34789|5555    |     5000|      290X2|  -20000|
|32934789290X2|  32934789|6666    |     6000|      290X2|   -2883|
|00000019290X2|  00000019|7777    |     7000|      290X2|    2000|
|3S534789290X2|  3S534789|8888    |     8000|      290X2|    8003|
+-------------+----------+--------+---------+-----------+--------+

我试图处理 -在来自bySecurityArray的每个数据帧上使用foreachpartition -然后使用foreach

处理结果数据集中的每一行(在foreach分区之后)

但是我只看到第一个数据帧(SECURITY_ID = 35G71)正在执行,而不是第二个数据帧(290X2)。

收到错误:

def methodA(d1: DataFrame): Unit {
    val securityIds = d1.select("SECURITY_ID").distinct.collect.flatMap(_.toSeq)
    val bySecurityArray = securityIds.map(securityIds => d1.where($"SECURITY_ID" <=> securityIds))

    for(i <- 0 until bySecurityArray.length) {
        allocOneDF = bySecurityArray(i).toDF()
        print("Number of partitions: " + allocProcessDF.rdd.getNumPartitions)
        methodB(allocProcessDF)
    }
} 

def methodA(d1: DataFrame): Unit {
    import org.apache.spark.api.java.function.ForeachPartitionFunction
    df.foreachPartition(ds => {

    //Tried below while and also foreach... its same result.
    //Option 1  
    while (ds.hasNext) {
        allocProcess(ds.next())
    }

    //Option 2
    ds.foreach(row => allocProcess(row))

    })

}

1 个答案:

答案 0 :(得分:0)

Spark不会保留顺序,因为数据分布在整个分区上,但分区顺序仍然不能保证,因为可能有多个任务。为了获得逻辑顺序的coalcece(1),然后进行sort(cols:*)操作,可以对Datafame进行操作,以获得按指定列排序的新Datafame / Dataset,所有这些按升序排列。

def methodA(d1: DataFrame): Unit = {
val securityIds = d1.select("SECURITY_ID").distinct.collect.flatMap(_.toSeq)
val bySecurityArray = securityIds.map(securityId => d1.where(d1("SECURITY_ID") === securityId))

for (i <- 0 until bySecurityArray.length) {
  val allocOneDF = bySecurityArray(i).toDF()
  print("Number of partitions: " + allocOneDF.rdd.getNumPartitions)
  methodB(allocOneDF)
 }
}

def methodB(df: DataFrame): Unit = {
df.coalesce(1).sort("LONG_IND", "SHORT_IND").foreach(row => {
  println(row)
  //allocProcess(row)
 })
}