我的工作是解析http日志请求,最后一条语句正在查找名为controller_type
的字段,以查看它是否为like
某些条件,然后检查它是否{{1} }}。如果是这种情况,则为其赋值1,否则为0,然后创建这些1和0的和列。问题是,如果符合isNotNull
标准,我的工作就是计算它们,而不是真正关注controller_type
部分。我是否有逻辑或语法错误,或者我在如何构建此表达式时做错了什么?
isNotNull
这是等效的df = df.groupby(
fn.trunc(df['request_timestamp'], 'mon').alias(
'request_timestamp'),
df['account_id'],
df['account_guid'],
df['cluster_id'],
df['shard_id'],
df['unique_id'],
df['context_id'],
df['controller_type'],
df['controller_context_id'],
df['concat_user_id'],
df['user_id']) \
.agg(
fn.count(df['account_id']).alias('num_page_views'),
fn.sum(
fn.when(
((df['controller_type'].like('pages%')) &
(df['controller_context_id'].isNotNull())),
fn.lit(1))
.otherwise(fn.lit(0))
).alias('num_page_views_pages'),
fn.sum(
fn.when(
((df['controller_type'].like('files%')) &
(df['controller_context_id'].isNotNull())),
fn.lit(1))
.otherwise(fn.lit(0))
).alias('num_page_views_files'),
fn.sum(
fn.when(
((df['controller_type'].like('modules%')) &
(df['controller_context_id'].isNotNull())),
fn.lit(1))
.otherwise(fn.lit(0))
).alias('num_page_views_modules'),
fn.sum(
fn.when(
((df['controller_type'].like('assignments%')) &
(df['controller_context_id'].isNotNull())),
fn.lit(1))
.otherwise(fn.lit(0))
).alias('num_page_views_assignments'),
fn.sum(
fn.when(
((df['controller_type'].like('quizzes%')) &
(df['controller_context_id'].isNotNull())),
fn.lit(1))
.otherwise(fn.lit(0))
).alias('num_page_views_quizzes'),
fn.sum(
fn.when(
((df['controller_type'].like('discussion_topics%')) &
(df['controller_context_id'].isNotNull())),
fn.lit(1))
.otherwise(fn.lit(0))
).alias('num_page_views_discussion_topics'),
fn.sum(
fn.when(
((df['controller_type'].like('outcome%')) &
(df['controller_context_id'].isNotNull())),
fn.lit(1))
.otherwise(fn.lit(0))
).alias('num_page_views_outcomes'),
fn.countDistinct(df['user_id']).alias('num_distinct_user_logins'),
fn.countDistinct(df['session_id']).alias('num_sessions')
)
声明:
SQL
我认为我错过了一些东西,因为尝试一个小玩具问题,它似乎没有正确聚合:
SELECT
TRUNC(request_timestamp, 'month') AS request_timestamp,
account_id,
account_guid,
cluster_id,
shard_id,
unique_id,
context_id,
controller_type,
controller_context_id,
concat_user_id,
user_id,
COUNT(account_id) AS num_page_views,
SUM(CASE
WHEN controller_type LIKE 'pages%' AND
controller_context_id <> '' AND
controller_context_id IS NOT NULL
THEN 1
ELSE 0 END) AS num_page_views_pages,
SUM(CASE
WHEN controller_type LIKE 'files%' AND
controller_context_id <> '' AND
controller_context_id IS NOT NULL
THEN 1
ELSE 0 END) AS num_page_views_files,
SUM(CASE
WHEN controller_type LIKE 'modules%' AND
controller_context_id <> '' AND
controller_context_id IS NOT NULL
THEN 1
ELSE 0 END) AS num_page_views_modules,
SUM(CASE
WHEN controller_type LIKE 'assignments%' AND
controller_context_id <> '' AND
controller_context_id IS NOT NULL
THEN 1
ELSE 0 END) AS num_page_views_assignments,
SUM(CASE
WHEN controller_type LIKE 'quizzes%' AND
controller_context_id <> '' AND
controller_context_id IS NOT NULL
THEN 1
ELSE 0 END) AS num_page_views_quizzes,
SUM(CASE
WHEN controller_type LIKE 'discussion_topics%' AND
controller_context_id <> '' AND
controller_context_id IS NOT NULL
THEN 1
ELSE 0 END) AS num_page_views_discussion_topics,
SUM(CASE
WHEN controller_type LIKE 'outcome%' AND
controller_context_id <> '' AND
controller_context_id IS NOT NULL
THEN 1
ELSE 0 END) AS num_page_views_outcomes,
COUNT(DISTINCT session_id) AS num_sessions
FROM requests
GROUP BY
TRUNC(request_timestamp, 'month'),
account_id,
account_guid,
cluster_id,
shard_id,
unique_id,
context_id,
context_id,
controller_type,
controller_context_id,
concat_user_id,
user_id
我的结果是:
df = sqlContext.createDataFrame([('something', 'null', 'something'), ('null', 'something', 'something'), ('something', 'something', 'something')], ['a', 'b', 'c'])
df.groupby(df.a, df.b, df.c).agg(sum(when(df.a.isNotNull(), lit(1)).otherwise(lit(0)).alias('sum_col'))).show()
答案 0 :(得分:1)
您的方法不适用于玩具数据,因为字符串&#34; null&#34; IS NOT NULL
因此无法过滤掉。如果要检查字段是否包含"null"
,请使用相等==
。让我们用一个简单的例子说明
df = sc.parallelize([
(1, "null", ),
(2, None, ),
(3, "foo", )
]).toDF(["id", "x"])
df.select("*",
fn.col("x").isNull(), # check if value IS NULL - OK
fn.col("x") == "null", # check if value = 'null' - not valid here
fn.col("x") == None # check if value = NULL - WRONG - always NULL!
## fn.col("x") is None # Check if column is None - WRONG!
).show()
## +---+----+---------+----------+----------+
## | id| x|isnull(x)|(x = null)|(x = null)|
## +---+----+---------+----------+----------+
## | 1|null| false| true| null| # string = "null" but is NOT NULL
## | 2|null| true| null| null| # NULL IS NULL, but != 'null'
## | 3| foo| false| false| null| # not null
## +---+----+---------+----------+----------+
此外,您可以轻松地将所有条件简化为:
checks = [
('pages%', 'num_page_views_assignments'),
('quizzes%', 'num_page_views_quizzes'),
...
]
def count_like(pattern, label):
cond = (
fn.col('controller_type').like(pattern) &
fn.col('controller_context_id').isNotNull()
)
# Count will count only NOT NULL. We can omit otherwise
# and choose arbitrary value
return fn.count(fn.when(cond, 1).alias(label))
(df
.groupBy(...)
.agg(*[count_like(p, l) for p, l in checks]))