如何在Spark数据框中镜像模式

时间:2018-02-07 13:50:21

标签: scala apache-spark apache-spark-sql

我想创建一个包含另一个数据框架的数据框。

这是我想要创建镜像架构的代码。

val sqlContext = new org.apache.spark.sql.SQLContext(sc)

    import sqlContext.implicits._
    import org.apache.spark.{ SparkConf, SparkContext }
    import java.sql.{Date, Timestamp}
    import org.apache.spark.sql.Row
    import org.apache.spark.sql.types._
    import org.apache.spark.sql.functions.udf


val getFFActionParent =  udf { (FFAction: String) => 
    if (FFAction=="Insert") "I|!|"
    else if (FFAction=="Overwrite") "I|!|"
    else "D|!|" 
}

val getFFActionChild =  udf { (FFAction: String) => 
    if (FFAction=="Insert") "I|!|"
    else if (FFAction=="Overwrite") "O|!|"
    else "D|!|" 
}

val dfContentEnvelope = sqlContext.read.format("com.databricks.spark.xml").option("rowTag", "env:ContentEnvelope").load("s3://trfsmallfffile/XML")


val dfContentItem = dfContentEnvelope.withColumn("column1", explode(dfContentEnvelope("env:Body.env:ContentItem"))).select("column1.*")
val FinancialSourceDF=dfContentItem.select($"env:Data.sr:Source.*",getFFActionParent($"_action").as("FFAction")).filter($"env:Data.sr:Source._organizationId".isNotNull).drop("sr:Auditors")
val FinancialSourceDFFinal=dfContentItem.select($"env:Data.sr:Source._organizationId".as("organizationId"), $"env:Data.sr:Source._sourceId".as("sourceId"), $"env:Data.sr:Source.*",getFFActionParent($"_action").as("FFAction")).filter($"env:Data.sr:Source._organizationId".isNotNull).drop("sr:Auditors")

架构将从数据框下方镜像

val rdd1 = sc.textFile("s3://trfsmallfffile/FinancialSource/INCR")
val header1 = rdd1.filter(_.contains("Source.organizationId")).map(line => line.split("\\|\\^\\|")).first()
val MirrorSchemaschema1 = StructType(header1.map(cols => StructField(cols.replace(".", "_"), StringType)).toSeq)

这里我尝试创建镜像架构但无法创建它..

val FinancialSourceSaveDF=FinancialSourceDF.select(FinancialSourceDF.columns.filter(x => !x.equals("sr:Auditors")).map(x => col(x).as(x.replace("_", "Source.").replace("sr:", ""))): _*)

FinancialSourceSaveDF.select(MirrorSchemaschema1.fieldNames.map(col): _*).show(false)

注意:FinancialSourceSaveDF数据框中的列可能会发生变化,有时候所有列都会出现,有时会出现其中一些列。 对于不存在的列,将填充空值。

1 个答案:

答案 0 :(得分:2)

获得dfContentItem之后,您只需select struct 重命名Source所需的元素使用alias 的列如下所示。

val dfContentItem = dfContentEnvelope.withColumn("column1", explode(dfContentEnvelope("env:Body.env:ContentItem"))).select("column1.*")
val dfType=dfContentItem.select($"env:Data.sr:Source.*", $"_action".as("FFAction|!|"))
val temp = dfType.select(dfType.columns.filter(x => !x.equals("sr:Auditors")).map(x => col(x).as(x.replace("_", "Source_").replace("sr:", ""))): _*)

你可以使用你喜欢的udf功能

一旦您知道schema 所需的所有

Source.organizationId|^|Source.sourceId|^|FilingDateTime|^|SourceTypeCode|^|DocumentId|^|Dcn|^|DocFormat|^|StatementDate|^|IsFilingDateTimeEstimated|^|ContainsPreliminaryData|^|CapitalChangeAdjustmentDate|^|CumulativeAdjustmentFactor|^|ContainsRestatement|^|FilingDateTimeUTCOffset|^|ThirdPartySourceCode|^|ThirdPartySourcePriority|^|SourceTypeId|^|ThirdPartySourceCodeId|^|FFAction|!|

现在您的主要问题是,您不知道引用的数据框架构中新数据框中可能缺少多少列名称 < / strong>即可。为此,您可以找到列名称之间的差异,并使用foldLeft函数填充缺少的列,其中null值为

val diff = schema.fieldNames.diff(temp.schema.fieldNames)
val finaldf = diff.foldLeft(temp){(temp2df, colName) => temp2df.withColumn(colName, lit(null))}

您可以select获取镜像架构,如下所示

finaldf.select(schema.fieldNames.map(col): _*).show(false)

您应该输出

+---------------------+---------------+-----------------------+--------------+----------+----+---------+-----------------------+-------------------------+-----------------------+---------------------------+--------------------------+-------------------+-----------------------+--------------------+------------------------+------------+----------------------+-----------+
|Source_organizationId|Source_sourceId|FilingDateTime         |SourceTypeCode|DocumentId|Dcn |DocFormat|StatementDate          |IsFilingDateTimeEstimated|ContainsPreliminaryData|CapitalChangeAdjustmentDate|CumulativeAdjustmentFactor|ContainsRestatement|FilingDateTimeUTCOffset|ThirdPartySourceCode|ThirdPartySourcePriority|SourceTypeId|ThirdPartySourceCodeId|FFAction|!||
+---------------------+---------------+-----------------------+--------------+----------+----+---------+-----------------------+-------------------------+-----------------------+---------------------------+--------------------------+-------------------+-----------------------+--------------------+------------------------+------------+----------------------+-----------+
|4295906830           |344            |20171111T17:00:00+00:00|10K           |null      |null|null     |20171030T00:00:00+00:00|false                    |false                  |20171030T00:00:00+00:00    |1.0                       |false              |300                    |SS                  |1                       |3011835     |1000716240            |Overwrite  |
+---------------------+---------------+-----------------------+--------------+----------+----+---------+-----------------------+-------------------------+-----------------------+---------------------------+--------------------------+-------------------+-----------------------+--------------------+------------------------+------------+----------------------+-----------+

我希望答案很有帮助