DataFrames

Question

由于我对Spark Scala有点陌生，因此发现很难遍历Dataframe。我的数据框包含2列，一列是path，另一列是ingestiontime。示例-

现在，我要遍历此数据框并使用Path和ingestiontime列中的数据来准备Hive查询并运行它，以便运行查询看起来像-

ALTER TABLE <hiveTableName> ADD PARTITON (ingestiontime=<Ingestiontime_From_the_DataFrame_ingestiontime_column>) LOCATION (<Path_From_the_dataFrames_path_column>)

要达到这个目的，我使用了-

allOtherIngestionTime.collect().foreach {
  row =>
     var prepareHiveQuery = "ALTER TABLE myhiveTable ADD PARTITION (ingestiontime = "+row.mkString("<SomeCustomDelimiter>").split("<SomeCustomDelimiter>")(1)+" LOCATION ( " + row.mkString("<SomeCustomDelimiter>").split("<SomeCustomDelimiter>")(0) + ")"
      spark.sql(prepareHiveQuery)

}

但是我觉得这可能非常危险，即当我的数据包含类似的定界符时。我非常有兴趣找到其他遍历数据框的行/列的方法。

Answer 1

检查以下代码。

df
.withColumn("query",concat_ws("",lit("ALTER TABLE myhiveTable ADD PARTITON (ingestiontime="),col("ingestiontime"),lit(") LOCATION (\""),col("path"),lit("\"))")))
.select("query")
.as[String]
.collect
.foreach(q => spark.sql(q))

Answer 2

要访问列path和ingestiontime，您可以row.getString(0)和row.getString(1)。

DataFrames

val allOtherIngestionTime: DataFrame = ???
    allOtherIngestionTime.foreach {
      row =>
        val prepareHiveQuery = "ALTER TABLE myhiveTable ADD PARTITION (ingestiontime = "+row.getString(1)+" LOCATION ( " + row.getString(0) + ")"
        spark.sql(prepareHiveQuery)
    }

数据集

如果您使用数据集而不是数据框，则可以更轻松地使用row.path和row.ingestiontime。

case class myCaseClass(path: String, ingestionTime: String)

val ds: Dataset[myCaseClass] = ???

ds.foreach({ row =>
  val prepareHiveQuery = "ALTER TABLE myhiveTable ADD PARTITION (ingestiontime = " + row.ingestionTime + " LOCATION ( " + row.path + ")"
  spark.sql(prepareHiveQuery)
})

无论如何，要遍历数据框或数据集，可以使用foreach，如果要将内容转换为其他内容，可以使用map。

此外，使用collect()会将所有数据带到驱动程序，并且不建议这样做，您可以使用foreach或map而不使用collect()

如果要迭代row字段，可以将其设为Seq并迭代：

row.toSeq.foreach{column => ...}

遍历数据帧的行

2 个答案:

DataFrames

数据集