据了解,withColumn一次只能占一列,所以如果我错了,我会感到尴尬,但我担心的是内存性能这是因为DF的生产可能非常大。基本上,我们的想法是在列数组(Array [String])上进行联合,使结果与输出结果不同,并在该集合上使用foldLeft来更新累积的DF&。我正在寻找一种编程方式来匹配两个DF的列,以便我可以在之后执行联合。
val (newLowerCaseDF, newMasterDF): (DataFrame,DataFrame) = lowerCaseDFColumns.union(masterDFColumns).distinct
.foldLeft[(DataFrame,DataFrame)]((lowerCaseDF, masterDF))((acc: (DataFrame, DataFrame), value: String) =>
if(!lowerCaseDFColumns.contains(value)) {
(acc._1.withColumn(value,lit(None)), acc._2)
}
else if(!masterDFColumns.contains(value)) {
(acc._1, acc._2.withColumn(value, lit(None)))
}
else{
acc
}
)
答案 0 :(得分:0)
发现可以选择硬编码的空列,所以我的新解决方案是:
val masterExprs = lowerCaseDFColumns.union(lowerCaseMasterDFColumns).distinct.map(field =>
//if the field already exists in master schema, we add the name to our select statement
if (lowerCaseMasterDFColumns.contains(field)) {
col(field.toLowerCase)
}
//else, we hardcode a null column in for that name
else {
lit(null).alias(field.toLowerCase)
}
)
val inputExprs = lowerCaseDFColumns.union(lowerCaseMasterDFColumns).distinct.map(field =>
//if the field already exists in master schema, we add the name to our select statement
if (lowerCaseDFColumns.contains(field)) {
col(field.toLowerCase)
}
//else, we hardcode a null column in for that name
else {
lit(null).alias(field.toLowerCase)
}
)
然后你就可以像这样做一个联盟:
masterDF.select(masterExprs: _*).union(lowerCaseDF.select(inputExprs: _*))