Question

我有一个表，其中包含许多类型的数据，并且某些数据中的一项信息对于分析其余数据确实非常重要。这是我的桌子

    name   |player_id|data_ms|coins|progress |
 progress  |  1223   | 10    |     |     128 |
 complete  |  1223   | 11    |  154|         |
 win       |  1223   | 9     |  111|         |
 progress  |  1223   | 11    |     |     129 |
 played    |  1111   | 19    |  141|         |
 progress  |  1111   | 25    |     |     225 |

这是我想要的桌子

    name    |player_id|data_ms|coins|progress |
 progress   |  1223   | 10    |     |     128 |
 complete   |  1223   | 11    |  154|     128 |
 win        |  1223   | 9     |  111|     129 |
 progress   |  1223   | 11    |     |     129 |
 played     |  1111   | 19    |  141|     225 |
 progress   |  1111   | 25    |     |     225 |

我需要使用以下条件查找播放器的进度，即该事件必须是此事件的data_ms（epoch unixtimstamp）之后发出的第一个进度。

我的表格有4十亿行数据，按数据划分。

我试图创建一个UDF函数，该函数应读取对其进行过滤的表，但这不是一个选择，因为您无法将spark序列化为UDF。

我该怎么做？

Answer 1

您似乎想填补栏目进度中的空白。我不是很了解这种情况，但是如果它基于Context，那么您的配置单元查询应如下所示：

data_ms

如何联接数据框中的数据

1 个答案: