Postgres:用左联接计数行

时间:2018-10-01 20:19:21

标签: sql postgresql

我正在尝试使用Postgres进行一些分析,其中确实有2个表,分别为:predictionstatepageviews

predictionstate表:

此表使用以下结构包含具有我们算法结果的列:

  • id({company_identifier}:{user_identifier}
  • 模型(参考字符串值)
  • 预测(浮点数在0.0到1.0之间)

pageviews表:

此表使用以下结构包含用户信息:

  • company_identifier
  • user_identifier
  • pageview_current_url_type

问题

我试图基于我们的最佳模型来获取数据,以分析数据的准确性,基本上我需要知道在哪里创建细分并计算我在其中拥有多少成员。以下代码可以做到这一点:

WITH ranges AS (
  SELECT
    myrange::text || '-' || (myrange + 0.1)::text AS segment,
    myrange as r_min, myrange + 0.1 as r_max
  FROM generate_series(0.0, 0.9, 0.1) AS myrange
)
SELECT
  SPLIT_PART(p.id, ':', 1) as company_identifier,
  p.model,
  r.segment,
  COUNT(DISTINCT(SPLIT_PART(p.id, ':', 2))) as "segment_users"
FROM
  ranges r
INNER JOIN predictionstate p ON p.prediction BETWEEN r.r_min AND r.r_max
GROUP BY company_identifier, p.model, r.segment
ORDER BY company_identifier, p.model, r.segment;

但是我遇到的问题是,由于我不知道确切的操作方法,因此,这是每个问题(公司,模型,细分市场)的问题,需要获取准确度的数据,并查询{{ 1}}表并标识pageviews

我尝试了但没用的东西:

pageview_current_url_type == 'BUYSUCCESS'

TL; DR:我需要根据主要查询用户计算一个JOIN。

编辑:

我添加了一个SQL Fiddle https://www.db-fiddle.com/f/5sQiZD6mHwdnwvVfvL9MAh/0

我想知道的是,对于那些WITH ranges AS ( SELECT myrange::text || '-' || (myrange + 0.1)::text AS segment, myrange as r_min, myrange + 0.1 as r_max FROM generate_series(0.0, 0.9, 0.1) AS myrange ) SELECT SPLIT_PART(p.id, ':', 1) as company_identifier, p.model, r.segment, COUNT(DISTINCT(SPLIT_PART(p.id, ':', 2))) as "segment_users", b.n as "converted_users" FROM ranges r, ( SELECT COUNT(DISTINCT(pvs.user_identifier)) as n FROM pageviews pvs INNER JOIN ( SELECT SPLIT_PART(id, ':', 1) as company_identifier, SPLIT_PART(id, ':', 2) as user_identifier FROM predictionstate ps WHERE prediction BETWEEN r.r_min AND r.r_max ) users ON ( pvs.user_identifier = users.user_identifier AND pvs.company_identifier= users.company_identifier) WHERE pageview_current_url_type = 'BUYSUCCESS' ) b INNER JOIN predictionstate p ON p.prediction BETWEEN r.r_min AND r.r_max GROUP BY company_identifier, p.model, r.segment ORDER BY company_identifier, p.model, r.segment; ,有多少人拥有segment_users,请在结果中再增加一列:pageview_current_url_type = 'BUYSUCCESS'

编辑2:又一次尝试不起作用(错误:“ p.id”列必须出现在GROUP BY子句中或在聚合函数中使用)

segmented_really_bought

编辑3:添加了所需的输出

使用以下代码生成:https://gist.github.com/brunoalano/479265b934a67dc02092fb54a846fe1e

WITH ranges AS (
  SELECT
    myrange::text || '-' || (myrange + 0.1)::text AS segment,
    myrange as r_min, myrange + 0.1 as r_max
  FROM generate_series(0.0, 0.9, 0.1) AS myrange
)
SELECT
  SPLIT_PART(p.id, ':', 1) as company_identifier,
  p.model,
  r.segment,
  COUNT(DISTINCT(SPLIT_PART(p.id, ':', 2))) as "segment_users",
  COUNT(b.*) as "converted_users"
FROM
  ranges r
INNER JOIN predictionstate p ON p.prediction BETWEEN r.r_min AND r.r_max
INNER JOIN (
  SELECT users.company_identifier, COUNT(users.user_identifier) AS n
  FROM pageviews
  INNER JOIN (
    SELECT SPLIT_PART(ps.id, ':', 2) AS user_identifier,
           SPLIT_PART(ps.id, ':', 1) AS company_identifier
    FROM predictionstate ps
    WHERE provider_id=47 AND
          prediction > 0.7
   ) users ON (
      pageviews.user_identifier=users.user_identifier AND
      pageviews.company_identifier=users.company_identifier
    )
  WHERE pageview_current_url_type='BUYSUCCESS'
  GROUP BY users.company_identifier
) AS b
ON (
  b.company_identifier = company_identifier
)
GROUP BY company_identifier, p.model, r.segment
ORDER BY company_identifier, p.model, r.segment;

2 个答案:

答案 0 :(得分:1)

如果没有示例输出,很难知道您需要什么,但我认为您正在寻找的是:

WITH ranges AS (
  SELECT
    myrange::text || '-' || (myrange + 0.1)::text AS segment,
    myrange as r_min, myrange + 0.1 as r_max
  FROM generate_series(0.0, 0.9, 0.1) AS myrange
)
SELECT
  p.company_identifier,
  p.model,
  r.segment,
  COUNT(DISTINCT(p.user_identifier)) as "segment_users",
  COUNT(CASE WHEN pv.pageview_current_url_type = 'BUYSUCCESS' THEN 1 END) AS segmented_really_bought
FROM
  ranges r
INNER JOIN (
  SELECT
    SPLIT_PART(id, ':', 1) as company_identifier,
    SPLIT_PART(id, ':', 2) as user_identifier,
    model,
    prediction
  FROM
    predictionstate
  ) p ON p.prediction BETWEEN r.r_min AND r.r_max
LEFT JOIN pageviews pv ON 
  p.company_identifier = pv.company_identifier
  AND p.user_identifier = pv.user_identifier
GROUP BY p.company_identifier, p.model, r.segment
ORDER BY p.company_identifier, p.model, r.segment;

更改为小提琴查询:

  • predictionstate替换为我们加入的子查询,我们在其中执行split_part逻辑以将comapny和用户标识符作为单独的列获取
  • 将这些标识符用于LEFT JOINpageviews
  • segmented_really_bought列中添加了COUNT的情况

答案 1 :(得分:1)

demo: db<>fiddle

WITH ranges AS (
  SELECT
    myrange::text || '-' || (myrange + 0.1)::text AS segment,
    myrange as r_min, myrange + 0.1 as r_max
  FROM generate_series(0.0, 0.9, 0.1) AS myrange
), pstate AS (                                         -- A
  SELECT 
    SPLIT_PART(ps.id, ':', 1) AS company_identifier,
    SPLIT_PART(ps.id, ':', 2) AS user_identifier,
    model,
    prediction
  FROM predictionstate ps
)
SELECT 
  company_identifier, model, segment,
  COUNT(DISTINCT user_identifier) as segment_users,    -- B
  -- C: 
  COUNT(user_identifier) FILTER (WHERE pageview_current_url_type = 'BUYSUCCESS') as really_bought
FROM pstate ps
LEFT JOIN ranges r 
ON prediction BETWEEN r_min AND r_max
LEFT JOIN pageviews pv 
USING (company_identifier, user_identifier)
GROUP BY company_identifier, model, segment
ORDER BY company_identifier, model, segment

A:我真的建议您将id列分成两列,以便更好地进行处理。这样可以节省大量时间来分割字符串(在编写查询并执行查询时),并且更具可读性。这就是为什么我添加第二个CTE。

B:COUNT(DISTINCT)计算组中的不同用户

C:对所有用户(不是唯一的)进行计数,但在进行计数之前过滤掉期望的状态。


我想知道:如果预测正好在阈值上,例如0.3,该怎么办?使用BETWEEN子句,该范围将同时在范围0.2-0.3和范围0.3-0.4中合并(因为BETWEEN等于r_min >= x >= r_max)。最好将范围定义为r_min >= x > r_maxr_min > x >= r_max。正如您在示例中提到的那样,我进行了加入,但我希望对其进行更改。我仍然不知道朝哪个方向