我正在将R代码转换为SQL。
我的R代码如下:
temp <- questions %>%
select(UserId, ResultIndicator, ThemeId) %>%
filter(UserId == 72) %>%
group_by(ThemeId, ResultIndicator) %>%
arrange(desc(ResultIndicator)) %>%
summarise(Nominal = n()) %>%
mutate(Percent = Nominal/sum(Nominal)) %>%
mutate(Percent = round(Percent, 3) * 100) %>%
mutate(diff = Percent - lag(Percent, default = first(Percent)))
输出如下:
structure(list(ThemeId = c(11L, 11L, 12L, 12L, 13L, 19L), ResultIndicator = c("Correct",
"Wrong", "Correct", "Wrong", "Correct", "Wrong"), Nominal = c(34L,
4L, 25L, 2L, 10L, 1L), Percent = c(89.5, 10.5, 92.6, 7.4, 100,
100), diff = c(0, -79, 0, -85.2, 0, 0)), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -6L), vars = "ThemeId", labels = structure(list(
ThemeId = c(11L, 12L, 13L, 19L)), class = "data.frame", row.names = c(NA,
-4L), vars = "ThemeId", labels = structure(list(ThemeId = c(11L,
12L, 13L, 19L, 22L, 33L, 35L, 38L, 48L, 56L, 59L, 62L, 71L, 77L
)), row.names = c(NA, -14L), class = "data.frame", vars = "ThemeId", drop = TRUE), indices = list(
0:1, 2:3, 4L, 5L, 6:7, 8:9, 10:11, 12L, 13:14, 15:16, 17:18,
19:20, 21:22, 23:24), drop = TRUE, group_sizes = c(2L, 2L,
1L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L), biggest_group_size = 2L), indices = list(
0:1, 2:3, 4L, 5L), drop = TRUE, group_sizes = c(2L, 2L, 1L,
1L), biggest_group_size = 2L)
对于非R用户,以上翻译为
ThemeId ResultIndicator Nominal Percent diff
1 11 Correct 34 89.5 0.0
2 11 Wrong 4 10.5 -79.0
3 12 Correct 25 92.6 0.0
4 12 Wrong 2 7.4 -85.2
5 13 Correct 10 100.0 0.0
6 19 Wrong 1 100.0 0.0
我在SQL中的尝试是:
SELECT Count(Id) as Nominal, ResultIndicator, ThemeId
FROM LogUserQuestions
WHERE UserId = 72
GROUP BY ThemeId, ResultIndicator
ORDER BY ThemeId
但是我不知道如何计算时滞。我尝试过:
(Nominal - lag(Nominal) over (partition by [not sure] order by [not sure])) as diff
但是我不能使用名义,因为它是后来创建的。
有任何提示吗?
答案 0 :(得分:2)
我认为是这样的:
SELECT ResultIndicator, ThemeId, COUNT(*) as Nominal,
COUNT(*) * 1.0 / SUM(COUNT(*)) OVER (),
COUNT(*) - LAG(COUNT(*)) OVER (ORDER BY ResultIndicator) as diff
FROM LogUserQuestions
WHERE UserId = 72
GROUP BY ThemeId, ResultIndicator
ORDER BY ResultIndicator DESC;
答案 1 :(得分:2)
实际上,与您的假设相反,但我不能使用名义,因为它是后来创建的,您可以在汇总查询中使用带有CTE的字段。实际上,您可能想对计算的 Percent 使用多个CTE,该CTE是基于较早的汇总 Nominal 得出的。但是,正如@GordonLinoff所示,所有内容都可以在一个查询中运行,但是可读性可能成为问题。
WITH agg AS (
SELECT ThemeId, ResultIndicator, Count(Id) as Nominal,
FROM LogUserQuestions
WHERE UserId = 72
GROUP BY ThemeId, ResultIndicator
), pct AS
SELECT ResultIndicator, ThemeId, Nominal,
Nominal / SUM(Nominal) OVER (PARTITION BY ThemeId, ResultIndicator) AS [Percent]
FROM agg
)
SELECT ResultIndicator, ThemeId, Nominal, ROUND([Percent], 0) AS [Percent],
-- LAG() maintains three arguments: expression, offset, default
([Percent] - LAG([Percent], 1, [Percent])
OVER (PARTITION BY ThemeId ORDER BY ResultIndicator DESC) as diff
FROM pct
答案 2 :(得分:1)
我在SQL表中没有您的数据,所以这有点困难。通常,在这种情况下,我要做的是构建唯一键,将整个结果包装在子查询中,然后从中选择。像这样:
SELECT Nominal,
ResultIndicator,
ThemeId,
(Nominal - lag(Nominal) over (partition by myKey order by myKey)) as diff
FROM
(
SELECT Count(Id) as Nominal, ResultIndicator, ThemeId, ResultIndicator + CAST(ThemeId as varchar(50)) as myKey
FROM LogUserQuestions
WHERE UserId = 72
GROUP BY ThemeId, ResultIndicator
ORDER BY ThemeId, ResultIndicator
) sub
order by Nominal