在SQL中对不同组合进行排名

时间:2019-05-30 21:44:46

标签: sql apache-zeppelin

我正在处理药房数据,并试图在一大批患者中对三种特定药物(A,B,C)的使用进行排名。简而言之,我想弄清楚人们使用的这些药物的前12种组合。例如,患者1可能会服用A + B药物,
患者2接受A + C,患者3接受B + C,患者4接受A + B,依此类推。我做了一些挖掘,有25种可能的组合进行排名。我希望我的输出看起来像这样:

enter image description here

我正在使用的表如下所示: enter image description here

目前,我正在通过以下操作将药物分为不同的组合组:

select distinct concat(substance_name, dosage, unit) as Drug_Dose_Combo,
count(distinct user_id) as Patients 
from pharmacy_data a join drug_reference_table b 
on a.drug_code=b.drug_code 
group by 1 
order by 2 desc

但是,这似乎效率很低,所以我正在寻找一种更好的方法来构建它。我不必在这里使用rank(),我只希望输出看起来与上面概述的相似。

2 个答案:

答案 0 :(得分:0)

也许类似(未经测试):

WITH meds_taken AS
  (SELECT sum(CASE WHEN d.drug_name = :namea THEN 1 ELSE 0 END) AS drug_a
        , sum(CASE WHEN d.drug_name = :nameb THEN 1 ELSE 0 END) AS drug_b
        , sum(CASE WHEN d.drug_name = :namec THEN 1 ELSE 0 END) AS drug_c
   FROM pharmacy_data AS p
   JOIN drug_reference AS d ON p.drug_code = d.drug_code
   GROUP BY p.user_id)
, med_counts AS
  (SELECT drug_a, drug_b, drug_c, count(*) AS "user total"
   FROM meds_taken
   GROUP BY drug_a, drug_b, drug_c)
SELECT rank() OVER (ORDER BY "user total" DESC) AS rank
     , drug_a, drug_b, drug_c, "user total"
FROM med_counts
ORDER BY "user total" DESC;

答案 1 :(得分:0)

好吧,不清楚您要寻找的是什么,但是您确实表示要基于最多三种药品的组合执行某种频率分析。

这样的分析的第一步是获取药房数据,并为每个user_id确定它们所参与的1、2和3个drug_dose组合的集合,因为可能要对substance_namedrug_name和/或drug_code进行相同的分析,我将把厨房的水槽扔给它,然后全部四个。尽管所使用的概念适用于Oracle,MySQL,PostgreSQL等数据库,但语法可能有所不同,但我不知道后端使用哪种类型的数据库,我将在此示例中使用SQL Server 2017。 / p>

要创建drug_code和其他组合,我首先将pharmacy_data表连接到drug_reference表,然后对复合数据使用递归查询:

with usage_info as (
  select pd.user_id
       , dr.drug_code
       , dr.drug_name
       , dr.substance_name
       , concat(dr.substance_name,dr.dosage,dr.unit) drug_dose
    from pharmacy_data pd
    join drug_reference dr
      on dr.drug_code = pd.drug_code
), recur(user_id, combo_id, dc_combo, dc_combo_size, dn_combo, sn_combo, dd_combo, last_dc) as (
  -- Anchor part
  select user_id
       , cast(cast(drug_code as binary(4)) as varbinary(max))
       , cast(drug_code as varchar(max))
       , 1
       , cast(drug_name as varchar(max))
       , cast(substance_name as varchar(max))
       , cast(drug_dose as varchar(max))
       , drug_code
    from usage_info

  union all
  -- Recursive Part
  select prev.user_id
       , prev.combo_id+cast(curr.drug_code as binary(4))
       , prev.dc_combo+','+cast(curr.drug_code as varchar(max))
       , prev.dc_combo_size+1
       , prev.dn_combo+','+curr.drug_name
       , prev.sn_combo+','+curr.substance_name
       , prev.dd_combo+','+curr.drug_dose
       , curr.drug_code
    from recur prev
    join usage_info curr
      on prev.user_id = curr.user_id
     and prev.last_dc < curr.drug_code
     and prev.dc_combo_size < 3 -- Maximum combination size
)

从上面的常见表表达式中选择问题中提供的数据:

select * from recur;

显示dn_combosn_combo以及可能的dd_combo列的分组中存在一些不规则性,例如'CAZERTA,BEXERA'和“ BEXERA,CAZERTA”实际上应该是等效的

要纠正这一点,我将通过将组合拆分并按排序顺序重新组合来对其进行归一化。在此过程中,我还将删除user_id可能具有两个或更多等效但不相同产品的任何实例,例如两种不同剂量的同一药物:

dn_combo

现在,尽管您可以选择, combos as ( select user_id , combo_id , dc_combo , dc_combo_size , -- Normalize and deduplicate Drug_Name combos (select string_agg(value,',') within group (order by value) from (select distinct value from string_split(dn_combo,',')) dn ) dn_combo , (select count(distinct value) from string_split(dn_combo,',')) dn_combo_size , -- Normalize and deduplicate Substance_Name combos (select string_agg(value,',') within group (order by value) from (select distinct value from string_split(sn_combo,',')) sn ) sn_combo , (select count(distinct value) from string_split(sn_combo,',')) sn_combo_size , -- Normalize and deduplicate Drug_Dose combos (select string_agg(value,',') within group (order by value) from (select distinct value from string_split(dd_combo,',')) ddc ) dd_combo , (select count(distinct value) from string_split(dd_combo,',')) dd_combo_size from recur ) 来获取每种药物组合的出现频率,但这些数字可能会被夸大。例如,如果您的数据另外有count(user_id) over (partition by <grouping_column>)为999,其中user_id分别为50、100、200和350(这是两种不同剂量的BEXERA以及AXIOM和CAZERTA),则{{1} } 999将为包含BEXERA的每个组合多次显示。根据您的数据库风格,您可以仅选择drug_code,但是从SQL Server 2017开始,它不允许在分析函数中使用不同的运算符。 user_id我们仍然可以做到,只是需要采取另一步来确定每个组的唯一值。输入Common Table combo2,我们将在其中计算各个分区的行号:

count(DISTINCT user_id) over (partition by <grouping_column>)

然后最终计算我们的计数,我们有两种类型。 </shrug>列是每种组合的不同, combo2 as ( select user_id , combo_id , dc_combo , dc_combo_size , row_number() over (partition by dc_combo, user_id order by dc_combo) dc_uid_rn , dn_combo , dn_combo_size , row_number() over (partition by dn_combo, user_id order by dc_combo) dn_uid_rn , row_number() over (partition by dn_combo, dc_combo order by user_id) dn_combo_rn , sn_combo , sn_combo_size , row_number() over (partition by sn_combo, user_id order by dc_combo) sn_uid_rn , row_number() over (partition by sn_combo, dc_combo order by user_id) sn_combo_rn , dd_combo , dd_combo_size , row_number() over (partition by dd_combo, user_id order by dc_combo) dd_uid_rn , row_number() over (partition by dd_combo, dc_combo order by user_id) dd_combo_rn from combos ) 的计数,而uid_cnt列表示组成较不精确的分组的独特的Drug_code组合的数量:

user_id

所有这些以及我的其他示例数据,上面的代码将导致以下 table 。要查看实际效果,请参见SQL Fiddle

combo_cnt