合并多个PySpark DataFrame行以从基于事件的数据转换为基于人的数据

时间:2017-05-22 14:23:29

标签: python apache-spark pyspark apache-spark-sql

因此,假设我有一个基于事件的订单的DataFrame。基本上每次发生事情时,我都会收到一个新的事件,表明有人改变了地点或工作。以下是输入示例的示例:

+--------+----+----------------+---------------+
|event_id|name|             job|       location|
+--------+----+----------------+---------------+
|      10| Bob|         Manager|               |
|       9| Joe|                |             HQ|
|       8| Tim|                |New York Office|
|       7| Joe|                |New York Office|
|       6| Joe| Head Programmer|               |
|       5| Bob|                |      LA Office|
|       4| Tim|         Manager|             HQ|
|       3| Bob|                |New York Office|
|       2| Bob|DB Administrator|             HQ|
|       1| Joe|      Programmer|             HQ|
+--------+----+----------------+---------------+

在此示例中,10是最新事件,1是最旧事件。现在我想得到关于每个人的最新信息。这是我想要的输出:

+----+---------------+---------------+
|name|            job|       location|
+----+---------------+---------------+
| Bob|        Manager|      LA Office|
| Joe|Head Programmer|             HQ|
| Tim|        Manager|New York Office|
+----+---------------+---------------+

我执行此重组的当前方式是收集数据,然后循环遍历事件,从最新到最旧,以便查找有关每个人的信息。这种方法的问题在于它对于大型DataFrame而言极其缓慢,并且它最终不会全部适合一台计算机的内存。使用spark做这件事的正确方法是什么?

1 个答案:

答案 0 :(得分:1)

根据你的问题,我认为这就是你想要的

 val spark =
    SparkSession.builder().master("local").appName("test").getOrCreate()

  import spark.implicits._

  val data = spark.sparkContext.parallelize(
    Seq(
      (10, "Bob", "Manager", ""),
      (9, "Joe", "", "HQ"),
      (8, "Tim", "", "New York Office"),
      (7, "Joe", "", "New York Office"),
      (6, "Joe", "Head Programmer", ""),
      (5, "Bob", "", "LA Office"),
      (4, "Tim", "Manager", "HQ"),
      (3, "Bob", "", "New York Office"),
      (2, "Bob", "DB Administrator", "HQ"),
      (1, "Joe", "Programmer", "HQ")
    )).toDF("event_id", "name", "job", "location")

  val latest = data.groupBy("name").agg(max(data("event_id")).alias("event_id"))

  latest.join(data, "event_id").drop("event_id").show

这是一个scala代码,希望你能用Python转换它