Question

我的数据如下。 It has three attributes: location, date, and student_id.

在熊猫，我可以做到

GROUPBY（[ '位置'， '日期']）[ 'student_id数据']。唯一的（）

在不同日期查看每个学生同时去那里学习的地点。

我的问题是如何在PySpark中使用相同的groupby来提取相同的信息？谢谢。

Answer 1

假设您的数据包含以下格式的行：

$cell = $row->nextCell();
$cell->setColSpan(3);

你可以这样做：

(location, date, student_id)

Answer 2

你可以在pyspark中使用collect_set来完成它，

 df.groupby('location','date').agg(F.collect_set('student_id')).show()

 +--------+----------+-----------------------+
 |location|      date|collect_set(student_id)|
 +--------+----------+-----------------------+
 |   18250|2015-01-04|               [347416]|
 |   18253|2015-01-02|       [167633, 188734]|
 |   18250|2015-01-03|               [363796]|
 +--------+----------+-----------------------+

等价groupyby（）。unique（）用于PySpark中的分类值

2 个答案: