Scala中的结构类型变平

时间:2019-03-30 21:07:45

标签: scala apache-spark dataframe user-defined-functions

我正在尝试从Spark Data框架中的结构类型创建列表。模式看起来像这样

root
|
|-- plotList: array (nullable = true)
|    |-- element: string (containsNull = true)
|-- plot: struct (nullable = true)
|    |-- test: struct (nullable = true)
|    |    |-- body: string (nullable = true)
|    |    |-- colorPair: struct (nullable = true)
|    |    |    |-- background: string (nullable = true)
|    |    |    |-- foreground: string (nullable = true)
|    |    |-- eta: struct (nullable = true)
|    |    |    |-- etaText: string (nullable = true)
|    |    |    |-- etaType: string (nullable = true)
|    |    |    |-- etaValue: string (nullable = true)
|    |    |-- headline: string (nullable = true)
|    |    |-- plotType: string (nullable = true)
|    |    |-- priority: long (nullable = true)
|    |    |-- plotCategory: string (nullable = true)
|    |    |-- productType: string (nullable = true)
|    |    |-- theme: string (nullable = true)
|    |-- temp: struct (nullable = true)
|    |    |-- body: string (nullable = true)
|    |    |-- colorPair: struct (nullable = true)
|    |    |    |-- background: string (nullable = true)
|    |    |    |-- foreground: string (nullable = true)
|    |    |-- eta: struct (nullable = true)
|    |    |    |-- etaText: string (nullable = true)
|    |    |    |-- etaType: string (nullable = true)
|    |    |    |-- etaValue: string (nullable = true)
|    |    |-- headline: string (nullable = true)
|    |    |-- logo: string (nullable = true)
|    |    |-- plotType: string (nullable = true)
|    |    |-- priority: long (nullable = true)
|    |    |-- plotCategory: string (nullable = true)
|    |    |-- plotType: string (nullable = true)
|    |    |-- theme: string (nullable = true)

我正在尝试编写一个UDF,该UDF可以将 plot 列转换为元素列表,我可以在下一次迭代中将其爆炸。情节上的东西-> [test,temp],在这里我可以从test和temp中选择一些特定的列。非常感谢任何正确方向的指点。我尝试了UDF的多种变体,但似乎都没有用。

编辑:

我想从图列的子列创建一个扁平化的结构。我正在考虑为此使用案例类。像

case class ColorPair(back:String, fore:String)
case class Eta(EtaText: String, EtaType: String, EtaValue: String)
case class Plot(body:String, colorPair: ColorPair, eta: Eta, headline: String, plotType: String, priority: String, plotCategory: String, plotType: String, theme: String)

因此,本质上在此之后,我期望像List(Plot)这样的东西,然后我可以在后续步骤中explode。因为爆炸无法直接在Struct Types上运行,所以我必须经历这种转换。在python世界中,我很容易将此列读成字典,但是Scala(据我所知)却不存在。

1 个答案:

答案 0 :(得分:1)

如果我理解正确,那么您正在寻找一种遍历架构的方法,并且当找到 colorPair eta 时,请返回以下字段:

plot.test.colorPair
plot.test.eta
plot.temp.colorPair
plot.temp.eta

要为您的案例生成数据(模式),我编写了下一个代码:

  case class Eta(etaText: String, etaType: String, etaValue: String)
  case class ColorPair(background: String, foreground: String)
  case class Test(body: String, colorPair: ColorPair, eta: Eta, headline: String, plotType: String, priority: Long, plotCategory: String, productType: String, theme: String)
  case class Temp(body: String, colorPair: ColorPair, eta: Eta ,headline: String, logo: String, plotType: String, priority: Long, plotCategory: String, productType: String, theme: String)
  case class Plot(test: Test, temp: Temp)
  case class Root(plotList: Array[String], plot: Plot)

  def getSchema(): StructType ={
    import org.apache.spark.sql.types._
    import org.apache.spark.sql.catalyst.ScalaReflection
    val schema = ScalaReflection.schemaFor[Root].dataType.asInstanceOf[StructType]

    schema.printTreeString()
    schema
  }

这将输出:

root
 |-- plotList: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- plot: struct (nullable = true)
 |    |-- test: struct (nullable = true)
 |    |    |-- body: string (nullable = true)
 |    |    |-- colorPair: struct (nullable = true)
 |    |    |    |-- background: string (nullable = true)
 |    |    |    |-- foreground: string (nullable = true)
 |    |    |-- eta: struct (nullable = true)
 |    |    |    |-- etaText: string (nullable = true)
 |    |    |    |-- etaType: string (nullable = true)
 |    |    |    |-- etaValue: string (nullable = true)
 |    |    |-- headline: string (nullable = true)
 |    |    |-- plotType: string (nullable = true)
 |    |    |-- priority: long (nullable = false)
 |    |    |-- plotCategory: string (nullable = true)
 |    |    |-- productType: string (nullable = true)
 |    |    |-- theme: string (nullable = true)
 |    |-- temp: struct (nullable = true)
 |    |    |-- body: string (nullable = true)
 |    |    |-- colorPair: struct (nullable = true)
 |    |    |    |-- background: string (nullable = true)
 |    |    |    |-- foreground: string (nullable = true)
 |    |    |-- eta: struct (nullable = true)
 |    |    |    |-- etaText: string (nullable = true)
 |    |    |    |-- etaType: string (nullable = true)
 |    |    |    |-- etaValue: string (nullable = true)
 |    |    |-- headline: string (nullable = true)
 |    |    |-- logo: string (nullable = true)
 |    |    |-- plotType: string (nullable = true)
 |    |    |-- priority: long (nullable = false)
 |    |    |-- plotCategory: string (nullable = true)
 |    |    |-- productType: string (nullable = true)
 |    |    |-- theme: string (nullable = true)

最后,下一个代码应将所需的字段展平:

def flattenSchema(schema: StructType, targetFields: List[String], prefix: String = null): Array[String]=
  {
    import org.apache.spark.sql.types._
    schema.fields.flatMap(f => {
      val colName = if (prefix == null) f.name else (prefix + "." + f.name)

      f.dataType match {
        case st : StructType =>
          val found = st.filter(s => targetFields.contains(s.name))

          if(found.isEmpty) {
            flattenSchema(st, targetFields, colName)
          }
          else
            found.flatMap(sf => {
              val st = sf.dataType.asInstanceOf[StructType]
              st.map(st => s"${colName}.${sf.name}.${st.name}")
            })

        case _ => Array[String]()
      }
    })
  }

上面的代码正在扫描架构以查找targetFields列表中存在的字段,然后使用flatMap来检索这些字段的架构。

这应该是输出:

plot.test.colorPair.background
plot.test.colorPair.foreground
plot.test.eta.etaText
plot.test.eta.etaType
plot.test.eta.etaValue
plot.temp.colorPair.background
plot.temp.colorPair.foreground
plot.temp.eta.etaText
plot.temp.eta.etaType
plot.temp.eta.etaValue