这是我数据框的输出
val finaldf.show(false)
+------------------+-------------------------+---------------------+---------------+-------------------------+--------------+----------+----------+---------+-------------------------+-------------------------+-----------------------+---------------------------+--------------------------+-------------------+-----------------------+--------------------+------------------------+------------+----------------------+-----------+
|DataPartition |TimeStamp |Source_organizationId|Source_sourceId|FilingDateTime |SourceTypeCode|DocumentId|Dcn |DocFormat|StatementDate |IsFilingDateTimeEstimated|ContainsPreliminaryData|CapitalChangeAdjustmentDate|CumulativeAdjustmentFactor|ContainsRestatement|FilingDateTimeUTCOffset|ThirdPartySourceCode|ThirdPartySourcePriority|SourceTypeId|ThirdPartySourceCodeId|FFAction|!||
+------------------+-------------------------+---------------------+---------------+-------------------------+--------------+----------+----------+---------+-------------------------+-------------------------+-----------------------+---------------------------+--------------------------+-------------------+-----------------------+--------------------+------------------------+------------+----------------------+-----------+
|SelfSourcedPrivate|2017-11-02T10:23:59+00:00|4298009288 |80 |2017-09-28T23:00:00+00:00|10K |null |171105584 |ASFILED |2017-07-31T00:00:00+00:00|false |false |2017-07-31T00:00:00+00:00 |1.0 |false |-300 |SS |1 |3011835 |1000716240 |I|!| |
|SelfSourcedPublic |2017-11-21T12:09:23+00:00|4295904170 |364 |2017-08-08T17:00:00+00:00|10Q |null |null |null |2017-07-30T00:00:00+00:00|false |false |2017-07-30T00:00:00+00:00 |1.0 |false |-300 |SS |1 |3011836 |1000716240 |I|!| |
|SelfSourcedPublic |2017-11-21T12:09:23+00:00|4295904170 |365 |2017-10-10T17:00:00+00:00|10K |null |null |null |2017-09-30T00:00:00+00:00|false |false |2017-09-30T00:00:00+00:00 |1.0 |false |-300 |SS |1 |3011835 |1000716240 |I|!| |
|SelfSourcedPublic |2017-11-21T12:17:49+00:00|4295904170 |365 |2017-10-10T17:00:00+00:00|10K |null |null |null |2017-09-30T00:00:00+00:00|false |false |2017-09-30T00:00:00+00:00 |1.0 |false |-300 |SS |1 |3011835 |1000716240 |I|!| |
何时从行中移除concat_ws
null
。
val finaldf = diff.foldLeft(tempReorder){(temp2df, colName) => temp2df.withColumn(colName, lit("null"))}
//finaldf.show(false)
val headerColumn = data.columns.toSeq
val header = headerColumn.mkString("", "|^|", "|!|").dropRight(3)
val finaldfWithDelimiter=finaldf.select(concat_ws("|^|",finaldf.schema.fieldNames.map(col): _*).as("concatenated")).withColumnRenamed("concatenated", header)
finaldfWithDelimiter.show(false)
我得到低于输出
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|DataPartition|^|TimeStamp|^|Source_organizationId|^|Source_sourceId|^|FilingDateTime|^|SourceTypeCode|^|DocumentId|^|Dcn|^|DocFormat|^|StatementDate|^|IsFilingDateTimeEstimated|^|ContainsPreliminaryData|^|CapitalChangeAdjustmentDate|^|CumulativeAdjustmentFactor|^|ContainsRestatement|^|FilingDateTimeUTCOffset|^|ThirdPartySourceCode|^|ThirdPartySourcePriority|^|SourceTypeId|^|ThirdPartySourceCodeId|^|FFAction|!||
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|SelfSourcedPrivate|^|2017-11-02T10:23:59+00:00|^|4298009288|^|80|^|2017-09-28T23:00:00+00:00|^|10K|^|171105584|^|ASFILED|^|2017-07-31T00:00:00+00:00|^|false|^|false|^|2017-07-31T00:00:00+00:00|^|1.0|^|false|^|-300|^|SS|^|1|^|3011835|^|1000716240|^|I|!| |
|SelfSourcedPublic|^|2017-11-21T12:09:23+00:00|^|4295904170|^|364|^|2017-08-08T17:00:00+00:00|^|10Q|^|2017-07-30T00:00:00+00:00|^|false|^|false|^|2017-07-30T00:00:00+00:00|^|1.0|^|false|^|-300|^|SS|^|1|^|3011836|^|1000716240|^|I|!| |
|SelfSourcedPublic|^|2017-11-21T12:09:23+00:00|^|4295904170|^|365|^|2017-10-10T17:00:00+00:00|^|10K|^|2017-09-30T00:00:00+00:00|^|false|^|false|^|2017-09-30T00:00:00+00:00|^|1.0|^|false|^|-300|^|SS|^|1|^|3011835|^|1000716240|^|I|!|
在替换为null的输出DocumentId
中。
无法弄清楚我错过了什么?
答案 0 :(得分:3)
concat_ws
会在连接过程中删除null
列。如果您想在连续结果中为每个null
保留一个占位符,则一种方法是为Map
创建colName -> nullValue
类型相关na.fill()
来转换数据帧在连接之前,如下所示:
val df = Seq(
(new Integer(1), "a"),
(new Integer(2), null),
(null, "c")
).toDF("col1", "col2")
df.withColumn("concat", concat_ws("|", df.columns.map(col): _*)).
show
// +----+----+------+
// |col1|col2|concat|
// +----+----+------+
// | 1| a| 1|a|
// | 2|null| 2|
// |null| c| c|
// +----+----+------+
val naMap = df.dtypes.map( t => t._2 match {
case "StringType" => (t._1, "(n/a)")
case "IntegerType" => (t._1, 0)
case "LongType" => (t._1, 0L)
// cases for other types ...
case _ => (t._1, "(unknown)")
} ).toMap
// naMap: scala.collection.immutable.Map[String,Any] =
// Map(col1 -> 0, col2 -> (n/a))
df.na.fill(naMap).
withColumn("concat", concat_ws("|", df.columns.map(col): _*)).
show
// +----+-----+-------+
// |col1| col2| concat|
// +----+-----+-------+
// | 1| a| 1|a|
// | 2|(n/a)|2|(n/a)|
// | 0| c| 0|c|
// +----+-----+-------+
答案 1 :(得分:2)
由于concat_ws
忽略了包含null
的列,因此您必须处理它们。
一种解决方案是按照建议的here为Map
创建一个类型依赖colName -> nullValue
的{{1}},但是您必须指定所有情况。
由于要获取na.fill()
,因此另一种方法是使用String
函数:
format_string
通过这种方式,您将避免使用// Proof of concept in Scala (I don't have the compiler to test it).
df
.withColumn(
"concat",
format_string(
(for (c <- df.columns) yield "%s").mkString("|"),
df.columns.map(col): _*
),
)
/*
Same solution tested in PySpark.
format_string(
'|'.join(['%s' for c in df.columns]),
*df.columns
)
*/
定义,并将在数据框列中为任何Map
值放置一个空字符串。
答案 2 :(得分:0)
您也可以使用udf
,例如:
val concatUDF: UserDefinedFunction = udf((columns: Seq[String]) =>
columns.map(c => if (c == null) "" else c).reduceLeft((a, b) => s"$a:$b"))
df.withColumn("concatenated", concatUDF(array(columns.map(col): _*)))
其中array
是org.apache.spark.sql.functions.array
。这不会替换原始列,并且将为空值或您希望替换的任何内容(if (c == null) ""
)返回空字符串。
此外,您可以扩展UDF以支持多种类型。