无法获得聚合求和函数来正确计算元素

时间:2016-06-30 16:31:32

标签: apache-spark pyspark

我的工作是解析http日志请求,最后一条语句正在查找名为controller_type的字段,以查看它是否为like某些条件,然后检查它是否{{1} }}。如果是这种情况,则为其赋值1,否则为0,然后创建这些1和0的和列。问题是,如果符合isNotNull标准,我的工作就是计算它们,而不是真正关注controller_type部分。我是否有逻辑或语法错误,或者我在如何构建此表达式时做错了什么?

isNotNull

这是等效的df = df.groupby( fn.trunc(df['request_timestamp'], 'mon').alias( 'request_timestamp'), df['account_id'], df['account_guid'], df['cluster_id'], df['shard_id'], df['unique_id'], df['context_id'], df['controller_type'], df['controller_context_id'], df['concat_user_id'], df['user_id']) \ .agg( fn.count(df['account_id']).alias('num_page_views'), fn.sum( fn.when( ((df['controller_type'].like('pages%')) & (df['controller_context_id'].isNotNull())), fn.lit(1)) .otherwise(fn.lit(0)) ).alias('num_page_views_pages'), fn.sum( fn.when( ((df['controller_type'].like('files%')) & (df['controller_context_id'].isNotNull())), fn.lit(1)) .otherwise(fn.lit(0)) ).alias('num_page_views_files'), fn.sum( fn.when( ((df['controller_type'].like('modules%')) & (df['controller_context_id'].isNotNull())), fn.lit(1)) .otherwise(fn.lit(0)) ).alias('num_page_views_modules'), fn.sum( fn.when( ((df['controller_type'].like('assignments%')) & (df['controller_context_id'].isNotNull())), fn.lit(1)) .otherwise(fn.lit(0)) ).alias('num_page_views_assignments'), fn.sum( fn.when( ((df['controller_type'].like('quizzes%')) & (df['controller_context_id'].isNotNull())), fn.lit(1)) .otherwise(fn.lit(0)) ).alias('num_page_views_quizzes'), fn.sum( fn.when( ((df['controller_type'].like('discussion_topics%')) & (df['controller_context_id'].isNotNull())), fn.lit(1)) .otherwise(fn.lit(0)) ).alias('num_page_views_discussion_topics'), fn.sum( fn.when( ((df['controller_type'].like('outcome%')) & (df['controller_context_id'].isNotNull())), fn.lit(1)) .otherwise(fn.lit(0)) ).alias('num_page_views_outcomes'), fn.countDistinct(df['user_id']).alias('num_distinct_user_logins'), fn.countDistinct(df['session_id']).alias('num_sessions') ) 声明:

SQL

我认为我错过了一些东西,因为尝试一个小玩具问题,它似乎没有正确聚合:

SELECT
            TRUNC(request_timestamp, 'month') AS request_timestamp,
            account_id,
            account_guid,
            cluster_id,
            shard_id,
            unique_id,
            context_id,
            controller_type,
            controller_context_id,
            concat_user_id,
            user_id,
            COUNT(account_id) AS num_page_views,
            SUM(CASE
                    WHEN controller_type LIKE 'pages%' AND
                         controller_context_id <> '' AND
                         controller_context_id IS NOT NULL
                    THEN 1
                    ELSE 0 END) AS num_page_views_pages,
            SUM(CASE
                    WHEN controller_type LIKE 'files%' AND
                         controller_context_id <> '' AND
                         controller_context_id IS NOT NULL
                    THEN 1
                    ELSE 0 END) AS num_page_views_files,
            SUM(CASE
                    WHEN controller_type LIKE 'modules%' AND
                         controller_context_id <> '' AND
                         controller_context_id IS NOT NULL
                    THEN 1
                    ELSE 0 END) AS num_page_views_modules,
            SUM(CASE
                    WHEN controller_type LIKE 'assignments%' AND
                         controller_context_id <> '' AND
                         controller_context_id IS NOT NULL
                    THEN 1
                    ELSE 0 END) AS num_page_views_assignments,
            SUM(CASE
                    WHEN controller_type LIKE 'quizzes%' AND
                         controller_context_id <> '' AND
                         controller_context_id IS NOT NULL
                    THEN 1
                    ELSE 0 END) AS num_page_views_quizzes,
            SUM(CASE
                    WHEN controller_type LIKE 'discussion_topics%' AND
                         controller_context_id <> '' AND
                         controller_context_id IS NOT NULL
                    THEN 1
                    ELSE 0 END) AS num_page_views_discussion_topics,
            SUM(CASE
                    WHEN controller_type LIKE 'outcome%' AND
                         controller_context_id <> '' AND
                         controller_context_id IS NOT NULL
                    THEN 1
                    ELSE 0 END) AS num_page_views_outcomes,
            COUNT(DISTINCT session_id) AS num_sessions
        FROM requests
        GROUP BY
          TRUNC(request_timestamp, 'month'),
          account_id,
          account_guid,
          cluster_id,
          shard_id,
          unique_id,
          context_id,
          context_id,
          controller_type,
          controller_context_id,
          concat_user_id,
          user_id

我的结果是:

df = sqlContext.createDataFrame([('something', 'null', 'something'), ('null', 'something', 'something'), ('something', 'something', 'something')], ['a', 'b', 'c'])

df.groupby(df.a, df.b, df.c).agg(sum(when(df.a.isNotNull(), lit(1)).otherwise(lit(0)).alias('sum_col'))).show()

1 个答案:

答案 0 :(得分:1)

您的方法不适用于玩具数据,因为字符串&#34; null&#34; IS NOT NULL因此无法过滤掉。如果要检查字段是否包含"null",请使用相等==。让我们用一个简单的例子说明

df = sc.parallelize([
    (1, "null", ),
    (2, None, ), 
    (3, "foo", )
]).toDF(["id", "x"])


df.select("*",
    fn.col("x").isNull(),    # check if value IS NULL  - OK
    fn.col("x") == "null",   # check if value = 'null' - not valid here 
    fn.col("x") == None      # check if value = NULL   - WRONG - always NULL!
    ## fn.col("x") is None   # Check if column is None - WRONG!
).show()

## +---+----+---------+----------+----------+
## | id|   x|isnull(x)|(x = null)|(x = null)|
## +---+----+---------+----------+----------+
## |  1|null|    false|      true|      null|   # string = "null" but is NOT NULL
## |  2|null|     true|      null|      null|   # NULL IS NULL, but != 'null'
## |  3| foo|    false|     false|      null|   # not null
## +---+----+---------+----------+----------+

此外,您可以轻松地将所有条件简化为:

checks = [
    ('pages%', 'num_page_views_assignments'),
    ('quizzes%', 'num_page_views_quizzes'),
    ...
]

def count_like(pattern, label):
   cond = (
       fn.col('controller_type').like(pattern) &
       fn.col('controller_context_id').isNotNull()
   )

   # Count will count only NOT NULL. We can omit otherwise
   # and choose arbitrary value
   return fn.count(fn.when(cond, 1).alias(label))

(df
    .groupBy(...)
    .agg(*[count_like(p, l) for p, l in checks]))