我想创建一个包含另一个数据框架的数据框。
这是我想要创建镜像架构的代码。
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
import org.apache.spark.{ SparkConf, SparkContext }
import java.sql.{Date, Timestamp}
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions.udf
val getFFActionParent = udf { (FFAction: String) =>
if (FFAction=="Insert") "I|!|"
else if (FFAction=="Overwrite") "I|!|"
else "D|!|"
}
val getFFActionChild = udf { (FFAction: String) =>
if (FFAction=="Insert") "I|!|"
else if (FFAction=="Overwrite") "O|!|"
else "D|!|"
}
val dfContentEnvelope = sqlContext.read.format("com.databricks.spark.xml").option("rowTag", "env:ContentEnvelope").load("s3://trfsmallfffile/XML")
val dfContentItem = dfContentEnvelope.withColumn("column1", explode(dfContentEnvelope("env:Body.env:ContentItem"))).select("column1.*")
val FinancialSourceDF=dfContentItem.select($"env:Data.sr:Source.*",getFFActionParent($"_action").as("FFAction")).filter($"env:Data.sr:Source._organizationId".isNotNull).drop("sr:Auditors")
val FinancialSourceDFFinal=dfContentItem.select($"env:Data.sr:Source._organizationId".as("organizationId"), $"env:Data.sr:Source._sourceId".as("sourceId"), $"env:Data.sr:Source.*",getFFActionParent($"_action").as("FFAction")).filter($"env:Data.sr:Source._organizationId".isNotNull).drop("sr:Auditors")
架构将从数据框下方镜像
val rdd1 = sc.textFile("s3://trfsmallfffile/FinancialSource/INCR")
val header1 = rdd1.filter(_.contains("Source.organizationId")).map(line => line.split("\\|\\^\\|")).first()
val MirrorSchemaschema1 = StructType(header1.map(cols => StructField(cols.replace(".", "_"), StringType)).toSeq)
这里我尝试创建镜像架构但无法创建它..
val FinancialSourceSaveDF=FinancialSourceDF.select(FinancialSourceDF.columns.filter(x => !x.equals("sr:Auditors")).map(x => col(x).as(x.replace("_", "Source.").replace("sr:", ""))): _*)
FinancialSourceSaveDF.select(MirrorSchemaschema1.fieldNames.map(col): _*).show(false)
注意:FinancialSourceSaveDF
数据框中的列可能会发生变化,有时候所有列都会出现,有时会出现其中一些列。
对于不存在的列,将填充空值。
答案 0 :(得分:2)
获得dfContentItem
之后,您只需select
struct 并重命名Source
所需的元素使用alias
的列如下所示。
val dfContentItem = dfContentEnvelope.withColumn("column1", explode(dfContentEnvelope("env:Body.env:ContentItem"))).select("column1.*")
val dfType=dfContentItem.select($"env:Data.sr:Source.*", $"_action".as("FFAction|!|"))
val temp = dfType.select(dfType.columns.filter(x => !x.equals("sr:Auditors")).map(x => col(x).as(x.replace("_", "Source_").replace("sr:", ""))): _*)
你可以使用你喜欢的udf功能
一旦您知道schema
列所需的所有列
Source.organizationId|^|Source.sourceId|^|FilingDateTime|^|SourceTypeCode|^|DocumentId|^|Dcn|^|DocFormat|^|StatementDate|^|IsFilingDateTimeEstimated|^|ContainsPreliminaryData|^|CapitalChangeAdjustmentDate|^|CumulativeAdjustmentFactor|^|ContainsRestatement|^|FilingDateTimeUTCOffset|^|ThirdPartySourceCode|^|ThirdPartySourcePriority|^|SourceTypeId|^|ThirdPartySourceCodeId|^|FFAction|!|
现在您的主要问题是,您不知道引用的数据框架构中新数据框中可能缺少多少列名称 < / strong>即可。为此,您可以找到列名称之间的差异,并使用foldLeft
函数填充缺少的列,其中null
值为
val diff = schema.fieldNames.diff(temp.schema.fieldNames)
val finaldf = diff.foldLeft(temp){(temp2df, colName) => temp2df.withColumn(colName, lit(null))}
您可以select
获取镜像架构,如下所示
finaldf.select(schema.fieldNames.map(col): _*).show(false)
您应该输出
+---------------------+---------------+-----------------------+--------------+----------+----+---------+-----------------------+-------------------------+-----------------------+---------------------------+--------------------------+-------------------+-----------------------+--------------------+------------------------+------------+----------------------+-----------+
|Source_organizationId|Source_sourceId|FilingDateTime |SourceTypeCode|DocumentId|Dcn |DocFormat|StatementDate |IsFilingDateTimeEstimated|ContainsPreliminaryData|CapitalChangeAdjustmentDate|CumulativeAdjustmentFactor|ContainsRestatement|FilingDateTimeUTCOffset|ThirdPartySourceCode|ThirdPartySourcePriority|SourceTypeId|ThirdPartySourceCodeId|FFAction|!||
+---------------------+---------------+-----------------------+--------------+----------+----+---------+-----------------------+-------------------------+-----------------------+---------------------------+--------------------------+-------------------+-----------------------+--------------------+------------------------+------------+----------------------+-----------+
|4295906830 |344 |20171111T17:00:00+00:00|10K |null |null|null |20171030T00:00:00+00:00|false |false |20171030T00:00:00+00:00 |1.0 |false |300 |SS |1 |3011835 |1000716240 |Overwrite |
+---------------------+---------------+-----------------------+--------------+----------+----+---------+-----------------------+-------------------------+-----------------------+---------------------------+--------------------------+-------------------+-----------------------+--------------------+------------------------+------------+----------------------+-----------+
我希望答案很有帮助