我正在尝试使用Postgres进行一些分析,其中确实有2个表,分别为:predictionstate
和pageviews
。
predictionstate
表:
此表使用以下结构包含具有我们算法结果的列:
{company_identifier}:{user_identifier}
) pageviews
表:
此表使用以下结构包含用户信息:
问题
我试图基于我们的最佳模型来获取数据,以分析数据的准确性,基本上我需要知道在哪里创建细分并计算我在其中拥有多少成员。以下代码可以做到这一点:
WITH ranges AS (
SELECT
myrange::text || '-' || (myrange + 0.1)::text AS segment,
myrange as r_min, myrange + 0.1 as r_max
FROM generate_series(0.0, 0.9, 0.1) AS myrange
)
SELECT
SPLIT_PART(p.id, ':', 1) as company_identifier,
p.model,
r.segment,
COUNT(DISTINCT(SPLIT_PART(p.id, ':', 2))) as "segment_users"
FROM
ranges r
INNER JOIN predictionstate p ON p.prediction BETWEEN r.r_min AND r.r_max
GROUP BY company_identifier, p.model, r.segment
ORDER BY company_identifier, p.model, r.segment;
但是我遇到的问题是,由于我不知道确切的操作方法,因此,这是每个问题(公司,模型,细分市场)的问题,需要获取准确度的数据,并查询{{ 1}}表并标识pageviews
。
我尝试了但没用的东西:
pageview_current_url_type == 'BUYSUCCESS'
TL; DR:我需要根据主要查询用户计算一个JOIN。
编辑:
我添加了一个SQL Fiddle https://www.db-fiddle.com/f/5sQiZD6mHwdnwvVfvL9MAh/0。
我想知道的是,对于那些WITH ranges AS (
SELECT
myrange::text || '-' || (myrange + 0.1)::text AS segment,
myrange as r_min, myrange + 0.1 as r_max
FROM generate_series(0.0, 0.9, 0.1) AS myrange
)
SELECT
SPLIT_PART(p.id, ':', 1) as company_identifier,
p.model,
r.segment,
COUNT(DISTINCT(SPLIT_PART(p.id, ':', 2))) as "segment_users",
b.n as "converted_users"
FROM
ranges r,
(
SELECT COUNT(DISTINCT(pvs.user_identifier)) as n
FROM pageviews pvs
INNER JOIN (
SELECT
SPLIT_PART(id, ':', 1) as company_identifier,
SPLIT_PART(id, ':', 2) as user_identifier
FROM predictionstate ps
WHERE prediction BETWEEN r.r_min AND r.r_max ) users
ON (
pvs.user_identifier = users.user_identifier AND
pvs.company_identifier= users.company_identifier)
WHERE pageview_current_url_type = 'BUYSUCCESS'
) b
INNER JOIN predictionstate p ON p.prediction BETWEEN r.r_min AND r.r_max
GROUP BY company_identifier, p.model, r.segment
ORDER BY company_identifier, p.model, r.segment;
,有多少人拥有segment_users
,请在结果中再增加一列:pageview_current_url_type = 'BUYSUCCESS'
。
编辑2:又一次尝试不起作用(错误:“ p.id”列必须出现在GROUP BY子句中或在聚合函数中使用)
segmented_really_bought
编辑3:添加了所需的输出
使用以下代码生成:https://gist.github.com/brunoalano/479265b934a67dc02092fb54a846fe1e
WITH ranges AS (
SELECT
myrange::text || '-' || (myrange + 0.1)::text AS segment,
myrange as r_min, myrange + 0.1 as r_max
FROM generate_series(0.0, 0.9, 0.1) AS myrange
)
SELECT
SPLIT_PART(p.id, ':', 1) as company_identifier,
p.model,
r.segment,
COUNT(DISTINCT(SPLIT_PART(p.id, ':', 2))) as "segment_users",
COUNT(b.*) as "converted_users"
FROM
ranges r
INNER JOIN predictionstate p ON p.prediction BETWEEN r.r_min AND r.r_max
INNER JOIN (
SELECT users.company_identifier, COUNT(users.user_identifier) AS n
FROM pageviews
INNER JOIN (
SELECT SPLIT_PART(ps.id, ':', 2) AS user_identifier,
SPLIT_PART(ps.id, ':', 1) AS company_identifier
FROM predictionstate ps
WHERE provider_id=47 AND
prediction > 0.7
) users ON (
pageviews.user_identifier=users.user_identifier AND
pageviews.company_identifier=users.company_identifier
)
WHERE pageview_current_url_type='BUYSUCCESS'
GROUP BY users.company_identifier
) AS b
ON (
b.company_identifier = company_identifier
)
GROUP BY company_identifier, p.model, r.segment
ORDER BY company_identifier, p.model, r.segment;
答案 0 :(得分:1)
如果没有示例输出,很难知道您需要什么,但我认为您正在寻找的是:
WITH ranges AS (
SELECT
myrange::text || '-' || (myrange + 0.1)::text AS segment,
myrange as r_min, myrange + 0.1 as r_max
FROM generate_series(0.0, 0.9, 0.1) AS myrange
)
SELECT
p.company_identifier,
p.model,
r.segment,
COUNT(DISTINCT(p.user_identifier)) as "segment_users",
COUNT(CASE WHEN pv.pageview_current_url_type = 'BUYSUCCESS' THEN 1 END) AS segmented_really_bought
FROM
ranges r
INNER JOIN (
SELECT
SPLIT_PART(id, ':', 1) as company_identifier,
SPLIT_PART(id, ':', 2) as user_identifier,
model,
prediction
FROM
predictionstate
) p ON p.prediction BETWEEN r.r_min AND r.r_max
LEFT JOIN pageviews pv ON
p.company_identifier = pv.company_identifier
AND p.user_identifier = pv.user_identifier
GROUP BY p.company_identifier, p.model, r.segment
ORDER BY p.company_identifier, p.model, r.segment;
更改为小提琴查询:
predictionstate
替换为我们加入的子查询,我们在其中执行split_part
逻辑以将comapny和用户标识符作为单独的列获取LEFT JOIN
至pageviews
segmented_really_bought
列中添加了COUNT
的情况答案 1 :(得分:1)
WITH ranges AS (
SELECT
myrange::text || '-' || (myrange + 0.1)::text AS segment,
myrange as r_min, myrange + 0.1 as r_max
FROM generate_series(0.0, 0.9, 0.1) AS myrange
), pstate AS ( -- A
SELECT
SPLIT_PART(ps.id, ':', 1) AS company_identifier,
SPLIT_PART(ps.id, ':', 2) AS user_identifier,
model,
prediction
FROM predictionstate ps
)
SELECT
company_identifier, model, segment,
COUNT(DISTINCT user_identifier) as segment_users, -- B
-- C:
COUNT(user_identifier) FILTER (WHERE pageview_current_url_type = 'BUYSUCCESS') as really_bought
FROM pstate ps
LEFT JOIN ranges r
ON prediction BETWEEN r_min AND r_max
LEFT JOIN pageviews pv
USING (company_identifier, user_identifier)
GROUP BY company_identifier, model, segment
ORDER BY company_identifier, model, segment
A:我真的建议您将id列分成两列,以便更好地进行处理。这样可以节省大量时间来分割字符串(在编写查询并执行查询时),并且更具可读性。这就是为什么我添加第二个CTE。
B:COUNT(DISTINCT)
计算组中的不同用户
C:对所有用户(不是唯一的)进行计数,但在进行计数之前过滤掉期望的状态。
我想知道:如果预测正好在阈值上,例如0.3
,该怎么办?使用BETWEEN
子句,该范围将同时在范围0.2-0.3
和范围0.3-0.4
中合并(因为BETWEEN
等于r_min >= x >= r_max
)。最好将范围定义为r_min >= x > r_max
或r_min > x >= r_max
。正如您在示例中提到的那样,我进行了加入,但我希望对其进行更改。我仍然不知道朝哪个方向