我将以下2个数据帧存储在数组中。数据已按SECURITY_ID分区。
const [avg, positive] = getValues();
如何分别处理每个数据框,并且在每个数据框下,我想一次处理一行。我尝试了以下
Dataframe 1 (DF1):
+-------------+----------+----------+--------+---------+---------+
| ACC_SECURITY|ACCOUNT_NO|LONG_IND|SHORT_IND|SECURITY_ID|QUANTITY|
+-------------+----------+--------+---------+-----------+--------+
|9161530335G71| 91615303|1111 | 1000| 35G71| -20000|
|9161530435G71| 91615304|2222 | 2000| 35G71| -2883|
|9161530235G71| 91615302|3333 | 3000| 35G71| 2000|
|9211530135G71| 92115301|4444 | 4000| 35G71| 8003|
+-------------+----------+--------+---------+-----------+--------+
Dataframe 2 (DF2):
+-------------+----------+----------+--------+---------+---------+
| ACC_SECURITY|ACCOUNT_NO|LONG_IND|SHORT_IND|SECURITY_ID|QUANTITY|
+-------------+----------+--------+---------+-----------+--------+
|3FA34789290X2| 3FA34789|5555 | 5000| 290X2| -20000|
|32934789290X2| 32934789|6666 | 6000| 290X2| -2883|
|00000019290X2| 00000019|7777 | 7000| 290X2| 2000|
|3S534789290X2| 3S534789|8888 | 8000| 290X2| 8003|
+-------------+----------+--------+---------+-----------+--------+
我试图处理 -在来自bySecurityArray的每个数据帧上使用foreachpartition -然后使用foreach
处理结果数据集中的每一行(在foreach分区之后)但是我只看到第一个数据帧(SECURITY_ID = 35G71)正在执行,而不是第二个数据帧(290X2)。
def methodA(d1: DataFrame): Unit {
val securityIds = d1.select("SECURITY_ID").distinct.collect.flatMap(_.toSeq)
val bySecurityArray = securityIds.map(securityIds => d1.where($"SECURITY_ID" <=> securityIds))
for(i <- 0 until bySecurityArray.length) {
allocOneDF = bySecurityArray(i).toDF()
print("Number of partitions: " + allocProcessDF.rdd.getNumPartitions)
methodB(allocProcessDF)
}
}
def methodA(d1: DataFrame): Unit {
import org.apache.spark.api.java.function.ForeachPartitionFunction
df.foreachPartition(ds => {
//Tried below while and also foreach... its same result.
//Option 1
while (ds.hasNext) {
allocProcess(ds.next())
}
//Option 2
ds.foreach(row => allocProcess(row))
})
}
答案 0 :(得分:0)
Spark不会保留顺序,因为数据分布在整个分区上,但分区顺序仍然不能保证,因为可能有多个任务。为了获得逻辑顺序的coalcece(1),然后进行sort(cols:*)操作,可以对Datafame进行操作,以获得按指定列排序的新Datafame / Dataset,所有这些按升序排列。
def methodA(d1: DataFrame): Unit = {
val securityIds = d1.select("SECURITY_ID").distinct.collect.flatMap(_.toSeq)
val bySecurityArray = securityIds.map(securityId => d1.where(d1("SECURITY_ID") === securityId))
for (i <- 0 until bySecurityArray.length) {
val allocOneDF = bySecurityArray(i).toDF()
print("Number of partitions: " + allocOneDF.rdd.getNumPartitions)
methodB(allocOneDF)
}
}
def methodB(df: DataFrame): Unit = {
df.coalesce(1).sort("LONG_IND", "SHORT_IND").foreach(row => {
println(row)
//allocProcess(row)
})
}