我有很大的数据集可以将一种结构转换为另一种结构。在该阶段,我还希望收集一些有关计算字段的信息(给定经度/纬度的四键)。我不想将此信息附加到每个结果行,因为它会提供很多重复信息和内存开销。我需要知道的是给定坐标触摸了哪些特定的四键键。是否有一种方法可以在一项工作中不对数据集进行两次迭代?
def load(paths: Seq[String]): (Dataset[ResultStruct], Dataset[String]) = {
val df = sparkSession.sqlContext.read.format("com.databricks.spark.csv").option("header", "true")
.schema(schema)
.option("delimiter", "\t")
.load(paths:_*)
.as[InitialStruct]
val qkSet = mutable.HashSet.empty[String]
val result = df.map(c => {
val id = c.id
val points = toPoints(c.geom)
points.foreach(p => qkSet.add(Quadkey.get(p.lat, p.lon, 6).getId))
createResultStruct(id, points)
})
return result, //some dataset created from qkSet's from all executors
}
答案 0 :(得分:1)
您可以使用蓄电池
class SetAccumulator[T] extends AccumulatorV2[T, Set[T]] {
import scala.collection.JavaConverters._
private val items = new ConcurrentHashMap[T, Boolean]
override def isZero: Boolean = items.isEmpty
override def copy(): AccumulatorV2[T, Set[T]] = {
val other = new SetAccumulator[T]
other.items.putAll(items)
other
}
override def reset(): Unit = items.clear()
override def add(v: T): Unit = items.put(v, true)
override def merge(
other: AccumulatorV2[T, Set[T]]): Unit = other match {
case setAccumulator: SetAccumulator[T] => items.putAll(setAccumulator.items)
}
override def value: Set[T] = items.keys().asScala.toSet
}
val df = Seq("foo", "bar", "foo", "foo").toDF("test")
val acc = new SetAccumulator[String]
spark.sparkContext.register(acc)
df.map {
case Row(str: String) =>
acc.add(str)
str
}.count()
println(acc.value)
打印
Set(bar, foo)
请注意,map
本身是惰性的,因此需要count
之类的东西来实际强制计算。根据实际用例,另一种选择是缓存数据帧并仅使用普通的SQL函数df.select("test").distinct()