我对Spark和Scala编码很陌生。我目前正在研究Spark DataFrames。我需要遍历记录并重复相同的值,直到满足下一个条件。请在下面的示例中找到,在给我的文件中我只有一列。该示例具有两种类型的值,一种是标头数据,另一种是明细数据。标头数据始终为10个字符长,详细数据始终为15个字符长。我想将前10个字符与下一个记录的15个字符合并,直到我们达到下一个10个字符,依此类推...
df
---------------
1RHGTY567U //header data
6786TYUIOPTR141 //detail data
6786TYUIOPTYU67 //detail data
T7997999HHBFFE6 //detail data
8YUITY567U //header data
HJS7890876997BB //detail data
BFJFBFKFN787897
GS678790877656H
BFJFDK786WQ4243
74849469GJGNVFM
67YUBMHJKH
VFJF788968FJFJD
HFJFGKJD789768D
GFJFHFFLLJFJDLD
我已经尝试过通过收集DataFrame,循环遍历并将其与其他记录连接来进行尝试,如下所示。我所遵循的方法是一个昂贵的操作,因为不建议使用collect()。我可以使用滞后窗函数将当前值与先前的值连接起来,但是我的情况几乎没有什么不同。
val srcDF = spark.read.format("csv").load(location + "/" + filename)
//Adding another column to the DataFrame which shows length of the value in the column
var newDF = srcDF.withColumn("col_length", length($"_c0"))
//Converting DataFrame to RDD
var RDD = newDF.map(row => row(0).toString + "," + row(1).toString).rdd
//Iterating through RDD to concatenate Header data with the detail
for (row <- RDD.collect) {
if (row.split(",")(1).toInt == 16) { Rec = row.split(",")(0).toString }
if (row.split(",")(1).toInt > 16) {
srcModified += Rec + row.split(",")(0).toString
}
else {
srcModified += Rec
}
}
//Converting ListBuffer to RDD
val modifiedRDD = sc.parallelize(srcModified.toSeq)
我期望的输出如下所示:
new_DF
------
1RHGTY567U //header data
1RHGTY567U6786TYUIOPTR141 //header data concatenated with detail data
1RHGTY567U6786TYUIOPTYU67 //header data concatenated with detail data
1RHGTY567UT7997999HHBFFE6 //header data concatenated with detail data
8YUITY567U //header data
8YUITY567UHJS7890876997BB //header data concatenated with detail data
8YUITY567UBFJFBFKFN787897 //header data concatenated with detail data
8YUITY567UGS678790877656H //header data concatenated with detail data
8YUITY567UBFJFDK786WQ4243 //header data concatenated with detail data
8YUITY567U74849469GJGNVFM //header data concatenated with detail data
67YUBMHJKH
67YUBMHJKHVFJF788968FJFJD
67YUBMHJKHHFJFGKJD789768D
67YUBMHJKHGFJFHFFLLJFJDLD
有什么建议吗?
答案 0 :(得分:1)
可以将增量列添加到数据框,并且按增量列的窗口将通过“ last”功能找到最新的标题:
val withId = originalDF.select($"value", monotonically_increasing_id().alias("id"))
val idWindow = Window.orderBy("id")
withId
.withColumn("previousHeader",
last( when(length($"value") < 15, $"value")
.otherwise(null), true).over(idWindow)
)
.select(
when($"value"=== $"previousHeader", $"value")
.otherwise(concat($"previousHeader", $"value")).alias("value")
)