我使用Pyspark dataframe
运行以下表达式:
md = data.filter(data['cluster_id'].like('cluster30')) \
.select(
udf_make_date(
fn.year(data['request_timestamp']),
fn.month(data['request_timestamp']),
fn.dayofmonth(data['request_timestamp'])
),
who_assigned,
fn.hour(data['request_timestamp']).alias('request_hour'),
fn.date_format(
data['request_timestamp'],
'F').alias('request_day_of_week'),
fn.lit(data.count()).alias('num_requests'),
fn.countDistinct(data['user_id']).alias('num_users'),
fn.avg(data['microseconds']).alias(
'avg_response_time_microseconds')) \
.groupBy(
udf_make_date(
fn.year(data['request_timestamp']),
fn.month(data['request_timestamp']),
fn.dayofmonth(data['request_timestamp'])
),
who_assigned,
fn.hour(data['request_timestamp']),
fn.date_format(
data['request_timestamp'],
'F')
)
并且收到以下错误:
pyspark.sql.utils.AnalysisException: "expression '`request_timestamp`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;".
据我所知,我应该在groupBy
中包含我需要的所有内容...我正在编写此内容以反映我的SQL
查询的结构,这看起来大致相同这样:
SELECT
MAKE_DATE(YEAR(request_timestamp), MONTH(request_timestamp), DAYOFMONTH(request_timestamp)),
CASE
lots of case logic here...
HOUR(request_timestamp) AS request_hour,
DATE_FORMAT(request_timestamp, 'F') request_day_of_week,
COUNT(*) as num_requests,
COUNT(DISTINCT user_id) num_users,
AVG(microseconds) AS avg_response_time_microseconds
FROM
(SELECT *
FROM {table}
WHERE cluster_id LIKE 'cluster30')
GROUP BY
MAKE_DATE(YEAR(request_timestamp), MONTH(request_timestamp), DAYOFMONTH(request_timestamp)),
CASE
lots of case logic here...
HOUR(request_timestamp),
DATE_FORMAT(request_timestamp, 'F')
答案 0 :(得分:3)
在Spark中,groupBy位于聚合之前。此外,在结果DataFrame中选择groupBy函数中的每一列。对于您的查询,SparkSQL API中的等价物将类似于:
data \
.filter(data['cluster_id'].like('cluster30')) \
.groupBy(
udf_make_date(
fn.year(data['request_timestamp']),
fn.month(data['request_timestamp']),
fn.dayofmonth(data['request_timestamp'])
).alias('request_date'),
who_assigned,
fn.hour(data['request_timestamp']).alias('request_hour'),
fn.date_format(
data['request_timestamp'],
'F'
).alias('request_day_of_week')
) \
.agg(
fn.countDistinct(data['user_id']).alias('num_users'),
fn.count('*').alias('num_requests'),
fn.avg(data['microseconds']).alias('avg_response_time_microseconds')
)