我正在处理药房数据,并试图在一大批患者中对三种特定药物(A,B,C)的使用进行排名。简而言之,我想弄清楚人们使用的这些药物的前12种组合。例如,患者1可能会服用A + B药物,
患者2接受A + C,患者3接受B + C,患者4接受A + B,依此类推。我做了一些挖掘,有25种可能的组合进行排名。我希望我的输出看起来像这样:
目前,我正在通过以下操作将药物分为不同的组合组:
select distinct concat(substance_name, dosage, unit) as Drug_Dose_Combo,
count(distinct user_id) as Patients
from pharmacy_data a join drug_reference_table b
on a.drug_code=b.drug_code
group by 1
order by 2 desc
但是,这似乎效率很低,所以我正在寻找一种更好的方法来构建它。我不必在这里使用rank(),我只希望输出看起来与上面概述的相似。
答案 0 :(得分:0)
也许类似(未经测试):
WITH meds_taken AS
(SELECT sum(CASE WHEN d.drug_name = :namea THEN 1 ELSE 0 END) AS drug_a
, sum(CASE WHEN d.drug_name = :nameb THEN 1 ELSE 0 END) AS drug_b
, sum(CASE WHEN d.drug_name = :namec THEN 1 ELSE 0 END) AS drug_c
FROM pharmacy_data AS p
JOIN drug_reference AS d ON p.drug_code = d.drug_code
GROUP BY p.user_id)
, med_counts AS
(SELECT drug_a, drug_b, drug_c, count(*) AS "user total"
FROM meds_taken
GROUP BY drug_a, drug_b, drug_c)
SELECT rank() OVER (ORDER BY "user total" DESC) AS rank
, drug_a, drug_b, drug_c, "user total"
FROM med_counts
ORDER BY "user total" DESC;
答案 1 :(得分:0)
好吧,不清楚您要寻找的是什么,但是您确实表示要基于最多三种药品的组合执行某种频率分析。
这样的分析的第一步是获取药房数据,并为每个user_id
确定它们所参与的1、2和3个drug_dose
组合的集合,因为可能要对substance_name
,drug_name
和/或drug_code
进行相同的分析,我将把厨房的水槽扔给它,然后全部四个。尽管所使用的概念适用于Oracle,MySQL,PostgreSQL等数据库,但语法可能有所不同,但我不知道后端使用哪种类型的数据库,我将在此示例中使用SQL Server 2017。 / p>
要创建drug_code
和其他组合,我首先将pharmacy_data
表连接到drug_reference
表,然后对复合数据使用递归查询:
with usage_info as (
select pd.user_id
, dr.drug_code
, dr.drug_name
, dr.substance_name
, concat(dr.substance_name,dr.dosage,dr.unit) drug_dose
from pharmacy_data pd
join drug_reference dr
on dr.drug_code = pd.drug_code
), recur(user_id, combo_id, dc_combo, dc_combo_size, dn_combo, sn_combo, dd_combo, last_dc) as (
-- Anchor part
select user_id
, cast(cast(drug_code as binary(4)) as varbinary(max))
, cast(drug_code as varchar(max))
, 1
, cast(drug_name as varchar(max))
, cast(substance_name as varchar(max))
, cast(drug_dose as varchar(max))
, drug_code
from usage_info
union all
-- Recursive Part
select prev.user_id
, prev.combo_id+cast(curr.drug_code as binary(4))
, prev.dc_combo+','+cast(curr.drug_code as varchar(max))
, prev.dc_combo_size+1
, prev.dn_combo+','+curr.drug_name
, prev.sn_combo+','+curr.substance_name
, prev.dd_combo+','+curr.drug_dose
, curr.drug_code
from recur prev
join usage_info curr
on prev.user_id = curr.user_id
and prev.last_dc < curr.drug_code
and prev.dc_combo_size < 3 -- Maximum combination size
)
从上面的常见表表达式中选择问题中提供的数据:
select * from recur;
显示dn_combo
,sn_combo
以及可能的dd_combo
列的分组中存在一些不规则性,例如'CAZERTA,BEXERA'和“ BEXERA,CAZERTA”实际上应该是等效的
要纠正这一点,我将通过将组合拆分并按排序顺序重新组合来对其进行归一化。在此过程中,我还将删除user_id可能具有两个或更多等效但不相同产品的任何实例,例如两种不同剂量的同一药物:
dn_combo
现在,尽管您可以选择, combos as (
select user_id
, combo_id
, dc_combo
, dc_combo_size
, -- Normalize and deduplicate Drug_Name combos
(select string_agg(value,',') within group (order by value)
from (select distinct value from string_split(dn_combo,',')) dn
) dn_combo
, (select count(distinct value) from string_split(dn_combo,',')) dn_combo_size
, -- Normalize and deduplicate Substance_Name combos
(select string_agg(value,',') within group (order by value)
from (select distinct value from string_split(sn_combo,',')) sn
) sn_combo
, (select count(distinct value) from string_split(sn_combo,',')) sn_combo_size
, -- Normalize and deduplicate Drug_Dose combos
(select string_agg(value,',') within group (order by value)
from (select distinct value from string_split(dd_combo,',')) ddc
) dd_combo
, (select count(distinct value) from string_split(dd_combo,',')) dd_combo_size
from recur
)
来获取每种药物组合的出现频率,但这些数字可能会被夸大。例如,如果您的数据另外有count(user_id) over (partition by <grouping_column>)
为999,其中user_id
分别为50、100、200和350(这是两种不同剂量的BEXERA以及AXIOM和CAZERTA),则{{1} } 999将为包含BEXERA的每个组合多次显示。根据您的数据库风格,您可以仅选择drug_code
,但是从SQL Server 2017开始,它不允许在分析函数中使用不同的运算符。 user_id
我们仍然可以做到,只是需要采取另一步来确定每个组的唯一值。输入Common Table combo2,我们将在其中计算各个分区的行号:
count(DISTINCT user_id) over (partition by <grouping_column>)
然后最终计算我们的计数,我们有两种类型。 </shrug>
列是每种组合的不同, combo2 as (
select user_id
, combo_id
, dc_combo
, dc_combo_size
, row_number() over (partition by dc_combo, user_id order by dc_combo) dc_uid_rn
, dn_combo
, dn_combo_size
, row_number() over (partition by dn_combo, user_id order by dc_combo) dn_uid_rn
, row_number() over (partition by dn_combo, dc_combo order by user_id) dn_combo_rn
, sn_combo
, sn_combo_size
, row_number() over (partition by sn_combo, user_id order by dc_combo) sn_uid_rn
, row_number() over (partition by sn_combo, dc_combo order by user_id) sn_combo_rn
, dd_combo
, dd_combo_size
, row_number() over (partition by dd_combo, user_id order by dc_combo) dd_uid_rn
, row_number() over (partition by dd_combo, dc_combo order by user_id) dd_combo_rn
from combos
)
的计数,而uid_cnt
列表示组成较不精确的分组的独特的Drug_code组合的数量:
user_id
所有这些以及我的其他示例数据,上面的代码将导致以下 table 。要查看实际效果,请参见SQL Fiddle:
combo_cnt