如何动态选择spark数据框中的struct column?

时间:2018-04-16 12:46:39

标签: apache-spark dataframe apache-spark-sql databricks

我试图在structframe的select列列表中推断struct的架构并构造一个包含struct字段的列表(用col括起来,替换为:with _ as alias name)。struct fields(properties)是可选的,所以我想根据输入数据构造select语句。

Schema推断:

  val listOfProperties = explodeFeatures.schema
     .filter(c => c.name == "listOfFeatures")
     .flatMap(_.dataType.asInstanceOf[StructType].fields).filter(y => y.name == "properties").flatMap(_.dataType.asInstanceOf[StructType].fields)
     .map(_.name).map(x => "col(\"listOfFeatures.properties."+x+"\").as(\"properties_"+x.replace(":","_")+"\")")

以上陈述的结果:(val listOfProperties)

col("type").as("type")
col("listOfFeatures.properties.a").as("properties_A"),
col("listOfFeatures.properties.b:P1").as("properties_b_P1"),
col("listOfFeatures.properties.C:ID").as("properties_C_ID"),
col("listOfFeatures.properties.D:l").as("properties_D_1")

选择声明:

explodeFeatures.select(listOfProperties .head , listOfProperties .tail : _*)

但是上面的语句在运行时无法解析。相反,如果我使用下面的硬编码就成功了。

explodeFeatures.select(
col("type").as("type"),
col("listOfFeatures.properties.a").as("properties_A"),
col("listOfFeatures.properties.b:P1").as("properties_b_P1"),
col("listOfFeatures.properties.C:ID").as("properties_C_ID"),
col("listOfFeatures.properties.D:l").as("properties_D_1"))

由于以下原因构建了一个列表,

需要访问struct变量, 需要重命名struct变量,因为它包含:in column name。

任何人都可以帮我解释为什么硬编码语句可以工作,但不能帮助我查看listOfProperties .head,listOfProperties .tail?

例外:

  

线程中的异常" main" org.apache.spark.sql.AnalysisException:   无法解决' col("type")'给定输入列:[type,   listOfFeatures];

1 个答案:

答案 0 :(得分:1)

根据评论中的建议,您的变量为Seq[String],当传递给select时,df.select("col(name)")看起来像col(name),这样就可以找到名为name的列map。您需要更改上一个val listOfProperties = explodeFeatures.schema .filter(c => c.name == "listOfFeatures") .flatMap(_.dataType.asInstanceOf[StructType].fields) .filter(y => y.name == "properties") .flatMap(_.dataType.asInstanceOf[StructType].fields) .map(_.name) .map(x => col(s"listOfFeatures.properties.${x}").as(s"""properties_${x.replace(":","_")}""" )) ,如下所示:

public void Initialize(InitializationEngine context)
{
    var events = ServiceLocator.Current.GetInstance<IContentEvents>();
    events.PublishedContent += EventsPublishedContent;
}

private void EventsPublishedContent(object sender, ContentEventArgs e)
{
    if (e.Content is myType)
    {
        var currentPage = e.Content as RatePlanPageType;

        var pdfPath = businessLogic.CreatePdf(e.content);

        var clone = currentPage.CreateWritableClone();

        clone.Property["PdfFiles"].Value = pdfPath;

        var contentRepository = ServiceLocator.Current.GetInstance<IContentRepository>();

        contentRepository.Save(clone, SaveAction.Save);
    }
}

旁注:使用字符串插值。它更清洁了!