Spark Dataframe groupby的问题

时间:2016-08-30 20:55:17

标签: apache-spark apache-spark-sql

我使用Pyspark dataframe运行以下表达式:

md = data.filter(data['cluster_id'].like('cluster30')) \
                .select(
                    udf_make_date(
                        fn.year(data['request_timestamp']),
                        fn.month(data['request_timestamp']),
                        fn.dayofmonth(data['request_timestamp'])
                    ),
                    who_assigned,
                    fn.hour(data['request_timestamp']).alias('request_hour'),
                    fn.date_format(
                        data['request_timestamp'],
                        'F').alias('request_day_of_week'),
                    fn.lit(data.count()).alias('num_requests'),
                    fn.countDistinct(data['user_id']).alias('num_users'),
                    fn.avg(data['microseconds']).alias(
                        'avg_response_time_microseconds')) \
                .groupBy(
                    udf_make_date(
                        fn.year(data['request_timestamp']),
                        fn.month(data['request_timestamp']),
                        fn.dayofmonth(data['request_timestamp'])
                    ),
                    who_assigned,
                    fn.hour(data['request_timestamp']),
                    fn.date_format(
                        data['request_timestamp'],
                        'F')
            )

并且收到以下错误:

pyspark.sql.utils.AnalysisException: "expression '`request_timestamp`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;".

据我所知,我应该在groupBy中包含我需要的所有内容...我正在编写此内容以反映我的SQL查询的结构,这看起来大致相同这样:

SELECT
MAKE_DATE(YEAR(request_timestamp), MONTH(request_timestamp), DAYOFMONTH(request_timestamp)),
CASE
  lots of case logic here...
HOUR(request_timestamp) AS request_hour,
DATE_FORMAT(request_timestamp, 'F') request_day_of_week,
COUNT(*) as num_requests,
COUNT(DISTINCT user_id) num_users,
AVG(microseconds) AS avg_response_time_microseconds
FROM
(SELECT *
FROM {table}
WHERE cluster_id LIKE 'cluster30')
GROUP BY
MAKE_DATE(YEAR(request_timestamp), MONTH(request_timestamp), DAYOFMONTH(request_timestamp)),
CASE
  lots of case logic here...
HOUR(request_timestamp),
DATE_FORMAT(request_timestamp, 'F')

1 个答案:

答案 0 :(得分:3)

在Spark中,groupBy位于聚合之前。此外,在结果DataFrame中选择groupBy函数中的每一列。对于您的查询,SparkSQL API中的等价物将类似于:

data \
    .filter(data['cluster_id'].like('cluster30')) \
    .groupBy(
         udf_make_date(
             fn.year(data['request_timestamp']),
             fn.month(data['request_timestamp']),
             fn.dayofmonth(data['request_timestamp'])
         ).alias('request_date'),
         who_assigned,
         fn.hour(data['request_timestamp']).alias('request_hour'),
         fn.date_format(
             data['request_timestamp'],
             'F'
         ).alias('request_day_of_week')
    ) \
    .agg(
        fn.countDistinct(data['user_id']).alias('num_users'),
        fn.count('*').alias('num_requests'),
        fn.avg(data['microseconds']).alias('avg_response_time_microseconds')
    )