我有以下表格:
+-----+---+----+
|type | t |code|
+-----+---+----+
| A| 25| 11|
| A| 55| 42|
| B| 88| 11|
| A|114| 11|
| B|220| 58|
| B|520| 11|
+-----+---+----+
我想要的是什么:
+-----+---+----+
|t1 | t2|code|
+-----+---+----+
| 25| 88| 11|
| 114|520| 11|
+-----+---+----+
有两种类型的事件A和B. 事件A是开始,事件B是结束。 我想将开头与代码的下一个结束依赖关系连接起来。
在SQL中很容易做到这一点:
SELECT a.t AS t1,
(SELECT b.t FROM events AS b WHERE a.code == b.code AND a.t < b.t LIMIT 1) AS t2, a.code AS code
FROM events AS a
但是我必须在Spark中实现这个问题,因为它看起来不支持这种嵌套查询......
我尝试过:
df.createOrReplaceTempView("events")
val sqlDF = spark.sql(/* SQL-query above */)
我得到的错误:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Accessing outer query column is not allowed in:
你还有其他想法可以解决这个问题吗?
答案 0 :(得分:3)
在SQL中执行此操作非常容易
幸运的是,Spark SQL也是如此。
val events = ...
scala> events.show
+----+---+----+
|type| t|code|
+----+---+----+
| A| 25| 11|
| A| 55| 42|
| B| 88| 11|
| A|114| 11|
| B|220| 58|
| B|520| 11|
+----+---+----+
// assumed that t is int
scala> events.printSchema
root
|-- type: string (nullable = true)
|-- t: integer (nullable = true)
|-- code: integer (nullable = true)
val eventsA = events.
where($"type" === "A").
as("a")
val eventsB = events.
where($"type" === "B").
as("b")
val solution = eventsA.
join(eventsB, "code").
where($"a.t" < $"b.t").
select($"a.t" as "t1", $"b.t" as "t2", $"a.code").
orderBy($"t1".asc, $"t2".asc).
dropDuplicates("t1", "code").
orderBy($"t1".asc)
那应该给你所要求的输出。
scala> solution.show
+---+---+----+
| t1| t2|code|
+---+---+----+
| 25| 88| 11|
|114|520| 11|
+---+---+----+