我正在尝试按星期在Spark数据框中进行分组,并且每个组都会计算一列的唯一值:
test.json
{"name":"Yin", "address":1111111, "date":20151122045510}
{"name":"Yin", "address":1111111, "date":20151122045501}
{"name":"Yln", "address":1111111, "date":20151122045500}
{"name":"Yun", "address":1111112, "date":20151122065832}
{"name":"Yan", "address":1111113, "date":20160101003221}
{"name":"Yin", "address":1111111, "date":20160703045231}
{"name":"Yin", "address":1111114, "date":20150419134543}
{"name":"Yen", "address":1111115, "date":20151123174302}
代码:
import pyspark.sql.funcions as func
from pyspark.sql.types import TimestampType
from datetime import datetime
df_y = sqlContext.read.json("/user/test.json")
udf_dt = func.udf(lambda x: datetime.strptime(x, '%Y%m%d%H%M%S'), TimestampType())
df = df_y.withColumn('datetime', udf_dt(df_y.date))
df_g = df_y.groupby(func.hour(df_y.date))
df_g.count().distinct().show()
pyspark的结果是
df_y.groupby(df_y.name).count().distinct().show()
+----+-----+
|name|count|
+----+-----+
| Yan| 1|
| Yun| 1|
| Yin| 4|
| Yen| 1|
| Yln| 1|
+----+-----+
我期待的是大熊猫这样的事情:
df = df_y.toPandas()
df.groupby('name').address.nunique()
Out[51]:
name
Yan 1
Yen 1
Yin 2
Yln 1
Yun 1
如何通过其他字段获取每个组的唯一元素,例如地址?
答案 0 :(得分:32)
使用函数countDistinct
计算每组不同元素的方法:
import pyspark.sql.functions as func
from pyspark.sql.types import TimestampType
from datetime import datetime
df_y = sqlContext.read.json("/user/test.json")
udf_dt = func.udf(lambda x: datetime.strptime(x, '%Y%m%d%H%M%S'), TimestampType())
df = df_y.withColumn('datetime', udf_dt(df_y.date))
df_g = df_y.groupby(func.hour(df_y.date))
df_y.groupby(df_y.name).agg(func.countDistinct('address')).show()
+----+--------------+
|name|count(address)|
+----+--------------+
| Yan| 1|
| Yun| 1|
| Yin| 2|
| Yen| 1|
| Yln| 1|
+----+--------------+
文档可用[这里](https://spark.apache.org/docs/1.6.0/api/java/org/apache/spark/sql/functions.html#countDistinct(org.apache.spark.sql.Column,org.apache.spark.sql.Column ...))。
答案 1 :(得分:3)
对字段“ _c1”进行分组的简洁直接答案,并计算字段“ _c2”中不同的值个数:
import pyspark.sql.functions as F
dg = df.groupBy("_c1").agg(F.countDistinct("_c2"))