Question

我有两个数据集，数据集1位于

之下

LineItem.organizationId|^|LineItem.lineItemId|^|StatementTypeCode|^|LineItemName|^|LocalLanguageLabel|^|FinancialConceptLocal|^|FinancialConceptGlobal|^|IsDimensional|^|InstrumentId|^|LineItemSequence|^|PhysicalMeasureId|^|FinancialConceptCodeGlobalSecondary|^|IsRangeAllowed|^|IsSegmentedByOrigin|^|SegmentGroupDescription|^|SegmentChildDescription|^|SegmentChildLocalLanguageLabel|^|LocalLanguageLabel.languageId|^|LineItemName.languageId|^|SegmentChildDescription.languageId|^|SegmentChildLocalLanguageLabel.languageId|^|SegmentGroupDescription.languageId|^|SegmentMultipleFundbDescription|^|SegmentMultipleFundbDescription.languageId|^|IsCredit|^|FinancialConceptLocalId|^|FinancialConceptGlobalId|^|FinancialConceptCodeGlobalSecondaryId|^|FFAction|!|
Japan|^|1507101869432|^|4295876606|^|1|^|BAL|^|Cash And Deposits|^|null|^|null|^|ACAE|^|false|^|null|^|null|^|null|^|null|^|false|^|null|^|null|^|null|^|null|^|505126|^|505074|^|null|^|null|^|null|^|null|^|null|^|null|^|null|^|3018759|^|null|^|I|!|

这就是我使用自动发现架构加载数据的方法

val df1With_ = df.toDF(df.columns.map(_.replace(".", "_")): _*)
val column_to_keep = df1With_.columns.filter(v => (!v.contains("^") && !v.contains("!") && !v.contains("_c"))).toSeq
val df1result = df1With_.select(column_to_keep.head, column_to_keep.tail: _*)

数据集2：

4295867927|^|860|^|CUS|^|External Revenue|^||^||^|REXR|^|False|^||^||^||^||^|False|^|False|^|CUS_REXR|^||^||^|505074|^|505074|^|505074|^|505074|^|505074|^||^|505074|^|True|^||^|3015250|^||^|I|!|

我从两者中创建一个数据框然后再加入。最后，我在csv文件中写入两个数据帧的输出。

以下是写入csv文件的代码。

val dfMainOutputFinal = dfMainOutput.select($"DataPartition", $"StatementTypeCode",concat_ws("|^|", dfMainOutput.schema.fieldNames.filter(_ != "DataPartition").map(c => col(c)): _*).as("concatenated"))

val dfMainOutputFinalWithoutNull = dfMainOutputFinal.withColumn("concatenated", regexp_replace(dfMainOutputFinal.col_*, "null", "")).show()

dfMainOutputFinal.write.partitionBy("DataPartition","StatementTypeCode")
  .format("csv")
  .option("nullValue", "")
  .option("codec", "gzip")
  .save("s3://trfsdisu/SPARK/FinancialLineItem/output")

除了.option("nullValue", "")之外，一切正常。我无法用空值替换null。

在我的输出中，我仍然看到空值。

我也尝试了这个但得到了相同的结果。

val newDf = df.na.fill("e",Seq("blank"))

Answer 1

我怀疑数据帧实际上并不包含 nulls ，但是它们是带有字母＆＃34; null＆＃34;的字符串。如果是这种情况，那么您可以简单地替换＆＃34; null＆＃34;的所有实例。用＆＃34;＆＃34;。在此之后，您可以像以前一样使用.option("nullValue", "")。要替换列中的字符串，可以使用regexp_replace(column, "string to replace", "string to replace with")。小例子：

val df = Seq("a", "null", "c", "b").toDF("col1")
val df2 = df.withColumn("col1", regexp_replace(col("col1"), "null", ""))

这里＆＃34; null＆＃34;被替换为＆＃34;＆＃34;根据需要，最终的数据框如下所示：

+----+
|col1|
+----+
|   a|
|    |
|   c|
|   b|
+----+

Answer 2

option("nullValue", "whatever")检查是否有任何“ whatever”列值，并将该列值在数据帧中视为null。

只需在读取过程中使用该选项，就可以了。

 Dataset<Row> df = spark.read().format("csv")
              .option("nullValue", "NULL")      // this config does the trick
              .option("sep", ",")
              .schema(structType)
              .load(filePath);

用spark数据框中的空值替换空值不起作用

2 个答案: