我正在尝试从Spark Data框架中的结构类型创建列表。模式看起来像这样
root
|
|-- plotList: array (nullable = true)
| |-- element: string (containsNull = true)
|-- plot: struct (nullable = true)
| |-- test: struct (nullable = true)
| | |-- body: string (nullable = true)
| | |-- colorPair: struct (nullable = true)
| | | |-- background: string (nullable = true)
| | | |-- foreground: string (nullable = true)
| | |-- eta: struct (nullable = true)
| | | |-- etaText: string (nullable = true)
| | | |-- etaType: string (nullable = true)
| | | |-- etaValue: string (nullable = true)
| | |-- headline: string (nullable = true)
| | |-- plotType: string (nullable = true)
| | |-- priority: long (nullable = true)
| | |-- plotCategory: string (nullable = true)
| | |-- productType: string (nullable = true)
| | |-- theme: string (nullable = true)
| |-- temp: struct (nullable = true)
| | |-- body: string (nullable = true)
| | |-- colorPair: struct (nullable = true)
| | | |-- background: string (nullable = true)
| | | |-- foreground: string (nullable = true)
| | |-- eta: struct (nullable = true)
| | | |-- etaText: string (nullable = true)
| | | |-- etaType: string (nullable = true)
| | | |-- etaValue: string (nullable = true)
| | |-- headline: string (nullable = true)
| | |-- logo: string (nullable = true)
| | |-- plotType: string (nullable = true)
| | |-- priority: long (nullable = true)
| | |-- plotCategory: string (nullable = true)
| | |-- plotType: string (nullable = true)
| | |-- theme: string (nullable = true)
我正在尝试编写一个UDF,该UDF可以将 plot 列转换为元素列表,我可以在下一次迭代中将其爆炸。情节上的东西-> [test,temp],在这里我可以从test和temp中选择一些特定的列。非常感谢任何正确方向的指点。我尝试了UDF的多种变体,但似乎都没有用。
编辑:
我想从图列的子列创建一个扁平化的结构。我正在考虑为此使用案例类。像
case class ColorPair(back:String, fore:String)
case class Eta(EtaText: String, EtaType: String, EtaValue: String)
case class Plot(body:String, colorPair: ColorPair, eta: Eta, headline: String, plotType: String, priority: String, plotCategory: String, plotType: String, theme: String)
因此,本质上在此之后,我期望像List(Plot)
这样的东西,然后我可以在后续步骤中explode
。因为爆炸无法直接在Struct Types
上运行,所以我必须经历这种转换。在python世界中,我很容易将此列读成字典,但是Scala(据我所知)却不存在。
答案 0 :(得分:1)
如果我理解正确,那么您正在寻找一种遍历架构的方法,并且当找到 colorPair 或 eta 时,请返回以下字段:
plot.test.colorPair
plot.test.eta
plot.temp.colorPair
plot.temp.eta
要为您的案例生成数据(模式),我编写了下一个代码:
case class Eta(etaText: String, etaType: String, etaValue: String)
case class ColorPair(background: String, foreground: String)
case class Test(body: String, colorPair: ColorPair, eta: Eta, headline: String, plotType: String, priority: Long, plotCategory: String, productType: String, theme: String)
case class Temp(body: String, colorPair: ColorPair, eta: Eta ,headline: String, logo: String, plotType: String, priority: Long, plotCategory: String, productType: String, theme: String)
case class Plot(test: Test, temp: Temp)
case class Root(plotList: Array[String], plot: Plot)
def getSchema(): StructType ={
import org.apache.spark.sql.types._
import org.apache.spark.sql.catalyst.ScalaReflection
val schema = ScalaReflection.schemaFor[Root].dataType.asInstanceOf[StructType]
schema.printTreeString()
schema
}
这将输出:
root
|-- plotList: array (nullable = true)
| |-- element: string (containsNull = true)
|-- plot: struct (nullable = true)
| |-- test: struct (nullable = true)
| | |-- body: string (nullable = true)
| | |-- colorPair: struct (nullable = true)
| | | |-- background: string (nullable = true)
| | | |-- foreground: string (nullable = true)
| | |-- eta: struct (nullable = true)
| | | |-- etaText: string (nullable = true)
| | | |-- etaType: string (nullable = true)
| | | |-- etaValue: string (nullable = true)
| | |-- headline: string (nullable = true)
| | |-- plotType: string (nullable = true)
| | |-- priority: long (nullable = false)
| | |-- plotCategory: string (nullable = true)
| | |-- productType: string (nullable = true)
| | |-- theme: string (nullable = true)
| |-- temp: struct (nullable = true)
| | |-- body: string (nullable = true)
| | |-- colorPair: struct (nullable = true)
| | | |-- background: string (nullable = true)
| | | |-- foreground: string (nullable = true)
| | |-- eta: struct (nullable = true)
| | | |-- etaText: string (nullable = true)
| | | |-- etaType: string (nullable = true)
| | | |-- etaValue: string (nullable = true)
| | |-- headline: string (nullable = true)
| | |-- logo: string (nullable = true)
| | |-- plotType: string (nullable = true)
| | |-- priority: long (nullable = false)
| | |-- plotCategory: string (nullable = true)
| | |-- productType: string (nullable = true)
| | |-- theme: string (nullable = true)
最后,下一个代码应将所需的字段展平:
def flattenSchema(schema: StructType, targetFields: List[String], prefix: String = null): Array[String]=
{
import org.apache.spark.sql.types._
schema.fields.flatMap(f => {
val colName = if (prefix == null) f.name else (prefix + "." + f.name)
f.dataType match {
case st : StructType =>
val found = st.filter(s => targetFields.contains(s.name))
if(found.isEmpty) {
flattenSchema(st, targetFields, colName)
}
else
found.flatMap(sf => {
val st = sf.dataType.asInstanceOf[StructType]
st.map(st => s"${colName}.${sf.name}.${st.name}")
})
case _ => Array[String]()
}
})
}
上面的代码正在扫描架构以查找targetFields
列表中存在的字段,然后使用flatMap
来检索这些字段的架构。
这应该是输出:
plot.test.colorPair.background
plot.test.colorPair.foreground
plot.test.eta.etaText
plot.test.eta.etaType
plot.test.eta.etaValue
plot.temp.colorPair.background
plot.temp.colorPair.foreground
plot.temp.eta.etaText
plot.temp.eta.etaType
plot.temp.eta.etaValue