Question

Heyo StackOverflow，

当前试图找到一种进行特定转换的优雅方法。

所以我有一个动作数据框，如下所示：

+---------+----------+----------+---------+
|timestamp|   user_id|    action|    value|
+---------+----------+----------+---------+
|      100|         1|     click|     null|
|      101|         2|     click|     null|
|      103|         1|      drag|      AAA|
|      101|         1|     click|     null|
|      108|         1|     click|     null|
|      100|         2|     click|     null|
|      106|         1|      drag|      BBB|
+---------+----------+----------+---------+

上下文：用户可以执行操作：单击和拖动。点击没有价值，拖动没有价值。拖曳意味着有点击，但没有相反。我们还假设拖动事件可以在点击事件之后或之前记录。因此，对于每个拖动，我都有一个相应点击操作。我想做的是将拖放动作合并为1，即。将其value赋予点击动作后，将其删除。

要知道哪个点击对应于哪个拖动，我必须选择其时间戳最接近拖动的timestamp的点击。我还想确保，如果时间戳差异超过5，则拖动不能链接到点击（这意味着某些拖动可能没有链接，这很好）。当然，我不希望用户1的拖动与用户2的点击相对应。

在这里，结果看起来像这样：

+---------+----------+----------+---------+
|timestamp|   user_id|    action|    value|
+---------+----------+----------+---------+
|      100|         1|     click|     null|
|      101|         2|     click|     null|
|      101|         1|     click|      AAA|
|      108|         1|     click|      BBB|
|      100|         2|     click|     null|
+---------+----------+----------+---------+

具有AAA（timestamp = 103）的拖动与发生在101处的点击相关，因为它最接近103。BBB的逻辑相同。

因此，我想以一种流畅/高效的方式执行这些操作。到目前为止，我有这样的事情：

val window =  Window partitionBy ($"user_id") orderBy $"timestamp".asc

myDF
  .withColumn("previous_value", lag("value", 1, null) over window)
  .withColumn("previous_timestamp", lag("timestamp", 1, null) over window)
  .withColumn("next_value", lead("value", 1, null) over window)
  .withColumn("next_timestamp", lead("timestamp", 1, null) over window)

  .withColumn("value",
        when(
            $"previous_value".isNotNull and
            // If there is more than 5 sec. difference, it shouldn't be joined
            $"timestamp" - $"previous_timestamp" < 5 and
            (
                $"next_timestamp".isNull or
                $"next_timestamp" - $"timestamp" > $"timestamp" - $"previous_timestamp"
            ), $"previous_value")
        .otherwise(
            when($"next_timestamp" - $"timestamp" < 5, $"next_value")
            .otherwise(null)
        )
    )
  .filter($"action" === "click")
  .drop("previous_value")
  .drop("previous_timestamp")
  .drop("next_value")
  .drop("next_timestamp")

但是我觉得这效率很低。有一个更好的方法吗？（无需创建4个临时列即可完成此操作...）有没有办法在同一个表达式中同时处理偏移量为-1和+1的行？

谢谢！

Answer 1

这是我尝试使用Spark-SQL而不是DataFrame API，但应该可以进行转换：

public class MainClass<T>
{
       protected TestLinkedList<T> testEventLinkedList;
       protected List<TestEvent<T>> E_LinkedList;

       public void execute()
       {
          E_LinkedList = new LinkedList<TestEvent<T>>();

          // there's a loop here to add objects in the linked list 

             E_LinkedList.add(testEvent);

          // end loop

           testEventLinkedList = new TestLinkedList <T>(E_LinkedList);
           testEventLinkedList. listTestEvent.addAll(E_LinkedList);

           while(testEventLinkedList.hasNext())
           {
              TestEvent<T> test = testEventLinkedList.next();
              testEventLinkedList.remove();

              // I need this object here, added on my list sorted by the **spx value**
              // thus, probably, this new object will be added at the end of the list, according to the spx value.

              testEventLinkedList.add (new TestEvent<T>(test.obj, test.p, spx, pos));
           }
      }
}

经过测试可产生所需的输出。

根据多个条件优雅地合并Spark上的行

1 个答案: