spark数据帧groupby和distinct没有产生预期的结果我期望...不确定我的字符串中的字符是否会导致问题。
scala> df.select(df("ID")).distinct.show(false)
2018-05-25 17:17:29 WARN TaskSetManager:66 - Stage 68 contains a task of very large size (556 KB). The maximum recommended task size is 100 KB.
+------------------------------------+
|ID |
+------------------------------------+
|2445b371-7cec-41b0-947a-8a04c4e8cbbb|
|db33c6d4-26fb-42c8-99a3-1bdfc2bf4612|
+------------------------------------+
这没有意义,因为这些字符串不在数据集中......
scala> df.select(df("ID")).show(false)
2018-05-25 17:17:50 WARN TaskSetManager:66 - Stage 79 contains a task of very large size (556 KB). The maximum recommended task size is 100 KB.
+------------------------------------+
|ID |
+------------------------------------+
|80a9d91d-c66f-4bf7-89a0-acf9ccd8e1b8|
|80a9d91d-c66f-4bf7-89a0-acf9ccd8e1b8|
|80a9d91d-c66f-4bf7-89a0-acf9ccd8e1b8|
|80a9d91d-c66f-4bf7-89a0-acf9ccd8e1b8|
|80a9d91d-c66f-4bf7-89a0-acf9ccd8e1b8|
|80a9d91d-c66f-4bf7-89a0-acf9ccd8e1b8|
|80a9d91d-c66f-4bf7-89a0-acf9ccd8e1b8|
|80a9d91d-c66f-4bf7-89a0-acf9ccd8e1b8|
|80a9d91d-c66f-4bf7-89a0-acf9ccd8e1b8|
|80a9d91d-c66f-4bf7-89a0-acf9ccd8e1b8|
|80a9d91d-c66f-4bf7-89a0-acf9ccd8e1b8|
|80a9d91d-c66f-4bf7-89a0-acf9ccd8e1b8|
|80a9d91d-c66f-4bf7-89a0-acf9ccd8e1b8|
|80a9d91d-c66f-4bf7-89a0-acf9ccd8e1b8|
|80a9d91d-c66f-4bf7-89a0-acf9ccd8e1b8|
|80a9d91d-c66f-4bf7-89a0-acf9ccd8e1b8|
|80a9d91d-c66f-4bf7-89a0-acf9ccd8e1b8|
|80a9d91d-c66f-4bf7-89a0-acf9ccd8e1b8|
|80a9d91d-c66f-4bf7-89a0-acf9ccd8e1b8|
|80a9d91d-c66f-4bf7-89a0-acf9ccd8e1b8|
+------------------------------------+
only showing top 20 rows
类型是一个简单的字符串
scala> df.select(df("ID")).printSchema
root
|-- ID: string (nullable = true)