如何按名称从org.apache.spark.sql行中获取列?

时间:2018-08-01 17:34:31

标签: scala apache-spark apache-spark-sql spark-streaming

我已从kafka源读取记录到indexPath spark数据帧。 我想从private var people: [Person] { return Array(database.people).sortedArray(using: Person.defaultSortDescriptors) } private func person(at indexPath: IndexPath) -> Person { return people[indexPath.item] } 中选择一些列并进行一些操作。因此,要检查我是否获得正确的索引,我尝试在语句mydataframe中打印索引,如下所示:

row

但是上面的代码说该行只有一个名称println(row.getFieldIndex(pathtoDesiredColumnFromSchema)),没有列名val pathtoDesiredColumnFromSchema = "data.root.column1.column2.field" val myQuery = mydataframe.writeStream.foreach(new ForeachWriter[Row]() { override def open(partitionId: Long, version: Long): Boolean = true override def process(row: Row): Unit = { println(row.getFieldIndex(pathtoDesiredColumnFromSchema)) } override def close(errorOrNull: Throwable): Unit = {} }).outputMode("append").start()

通过名称路径从spark sql行获取列值的正确方法是什么?

2 个答案:

答案 0 :(得分:2)

您可以对getAs类型使用struct调用链,例如:

val df = spark.range(1,5).toDF.withColumn("time", current_timestamp())
.union(spark.range(5,10).toDF.withColumn("time", current_timestamp()))
.groupBy(window($"time", "1 millisecond")).count


df.printSchema
root
 |-- window: struct (nullable = true)
 |    |-- start: timestamp (nullable = true)
 |    |-- end: timestamp (nullable = true)
 |-- count: long (nullable = false)

df.take(1).head
          .getAs[org.apache.spark.sql.Row]("window")
          .getAs[java.sql.Timestamp]("start")

希望有帮助!

答案 1 :(得分:0)

如果您只想打印DataFrame的字段,就可以使用

mydataframe.select(pathtoDesiredColumnFromSchema).foreach(println(_.get(0)))