例如,这是我的现有标题
DataPartition|^|TimeStamp|^|Source.organizationId|^|Source.sourceId|^|FilingDateTime|^|SourceTypeCode|^|DocumentId|^|Dcn|^|DocFormat|^|StatementDate|^|IsFilingDateTimeEstimated|^|ContainsPreliminaryData|^|CapitalChangeAdjustmentDate|^|CumulativeAdjustmentFactor|^|ContainsRestatement|^|FilingDateTimeUTCOffset|^|ThirdPartySourceCode|^|ThirdPartySourcePriority|^|SourceTypeId|^|ThirdPartySourceCodeId|^|FFAction|!|
我想创建如下的标题
DataPartition_1|^|TimeStamp|^|Source.organizationId|^|Source.sourceId|^|FilingDateTime_1|^|SourceTypeCode_1|^|DocumentId_1|^|Dcn_1|^|DocFormat_1|^|StatementDate_1|^|IsFilingDateTimeEstimated_1|^|ContainsPreliminaryData_1|^|CapitalChangeAdjustmentDate_1|^|CumulativeAdjustmentFactor_1|^|ContainsRestatement_1|^|FilingDateTimeUTCOffset_1|^|ThirdPartySourceCode_1|^|ThirdPartySourcePriority_1|^|SourceTypeId_1|^|ThirdPartySourceCodeId_1|^|FFAction_1
除了TimeStamp|^|Source.organizationId|^|Source.sourceId
列之外,我想在所有标题列中附加_1
我是通过使用withColumn
完成的,但是使用这个我必须为所有列做。
有没有简单的方法可以使用foldLeft
?
答案 0 :(得分:1)
首先,您需要定义要跳过的列的列表:
val columnsToAvoid = List("TimeStamp","Source.organizationId","Source.sourceId")
然后,您可以foldLeft
覆盖dataFrame的列列表(由df.columns
给出)重命名其未包含在columnsToAvoid列表中的每个列,否则返回未更改的dataFrame。 / p>
df.columns.foldLeft(df)((acc, elem) =>
if (columnsToAvoid.contains(elem)) acc
else acc.withColumnRenamed(elem, elem+"_1"))
这里有一个简单的例子:
原创DF
+-----+------+-----------+
| word| value| TimeStamp|
+-----+------+-----------+
|wordA|valueA|45435345435|
|wordB|valueB| 454244345|
|wordC|valueC|32425425435|
+-----+------+-----------+
操作:
df.columns.foldLeft(df)((acc, elem) => if (columnsToAvoid.contains(elem)) acc else acc.withColumnRenamed(elem, elem+"_1")).show
结果:
+------+-------+-----------+
|word_1|value_1| TimeStamp|
+------+-------+-----------+
| wordA| valueA|45435345435|
| wordB| valueB| 454244345|
| wordC| valueC|32425425435|
+------+-------+-----------+