计算在主字符串中找到的值的总和

时间:2019-06-27 14:27:19

标签: sql oracle sas

我目前正在处理医疗剂量数据。它是一个大数据集/ oracle表,具有包含数百万条记录的字符串变量。字符串变量如下所示:

this.state

这些是示例记录。我需要从此主字符串中找到MG(毫克)剂量并计算总和。例如:

Drug_Direction
(1 JAN) INJECT 2ML (100MG) IV/IM AM THEN 0.5ML (25MG) 20 MIN LATER, THEN 2.5ML (125MG) PM
(SEP 20, 2018) INJECT 0.3ML (30MG) ON S1, 0.6ML (60MG) ON S2 AND 2ML(200MG) ON S3

此外,字符串也不是固定格式。有时会有变化,例如仅存在2或1MG剂量。在这种情况下。我只需要得到那些MG剂量。我了解我可能需要计算MG发生次数,找到数字并求和。我正在同时工作。

在Oracle中也可以使用相同的数据。因此,如果在Oracle-sql中有更简便的方法可以做到这一点,那也是值得欢迎的。

2 个答案:

答案 0 :(得分:2)

这在Oracle中相当容易做到。您可以:

  1. 使用REGEXP_COUNT来计算每个字符串中MG值的出现次数
  2. 使用CONNECT BY为每个匹配项创建一行
  3. 使用REGEXP_SUBSTR来获取每个实际匹配项
  4. 将字符串转换为数值并将其加起来

类似这样的东西:

WITH test_vals AS (
    SELECT '(1 JAN) INJECT 2ML (100MG) IV/IM AM THEN 0.5ML (25MG) 20 MIN LATER, THEN 2.5ML (125MG) PM' AS drug_direction FROM dual
    UNION ALL SELECT '(SEP 20, 2018) INJECT 0.3ML (30MG) ON S1, 0.6ML (60MG) ON S2 AND 2ML(200MG) ON S3' FROM dual
),

match_rows AS ( /* Get a row for each match */
    SELECT DISTINCT 
           m.drug_direction,
           LEVEL AS mg_occurrance_num
    FROM test_vals m
    CONNECT BY LEVEL <= REGEXP_COUNT(m.drug_direction, '((\d+\.)?\d+)MG') /* Count number of matches in each string */
)

SELECT r.drug_direction,
       SUM(
          TO_NUMBER(
              REGEXP_SUBSTR(
                  r.drug_direction, 
                  '((\d+\.)?\d+)MG', 
                  1, 
                  r.mg_occurrance_num, /* Search for this specific occurrance */
                  '', 
                  1 /* Get first sub-group (the actual numeric value) */
              )
          )
       ) AS total_mg_value
FROM match_rows r
GROUP BY r.drug_direction
ORDER BY r.drug_direction

请注意,这假定所有值均采用该确切格式(数字值后跟字符串'MG')。

答案 1 :(得分:1)

假设它是单个文本字符串,那么在Oracle中,您可以使用多个递归子查询分解子句将字符串拆分为子字符串:

Oracle设置

CREATE TABLE table_name ( id, value ) AS
  SELECT 1, '(1 JAN) INJECT 2ML (100MG) IV/IM AM THEN 0.5ML (25MG) 20 MIN LATER, THEN 2.5ML (125MG) PM'
            || '(SEP 20, 2018) INJECT 0.3ML (30MG) ON S1, 0.6ML (60MG) ON S2 AND 2ML(200MG) ON S3' FROM DUAL;

查询

WITH datelines ( id, value, dt, pos, lvl ) AS (
  SELECT id,
         value,
         REGEXP_SUBSTR(
           value,
           '\((([0-2]?\d|3[01]) (JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC)|(JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC) ([0-2]?\d|3[01]), \d{4})\)',
           1,
           1,
           NULL,
           1
         ),
         REGEXP_INSTR(
           value,
           '\((([0-2]?\d|3[01]) (JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC)|(JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC) ([0-2]?\d|3[01]), \d{4})\)',
           1,
           1
         ),
         1
  FROM   table_name
UNION ALL
  SELECT id,
         value,
         REGEXP_SUBSTR(
           value,
           '\((([0-2]?\d|3[01]) (JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC)|(JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC) ([0-2]?\d|3[01]), \d{4})\)',
           1,
           LVL + 1,
           NULL,
           1
         ),
         REGEXP_INSTR(
           value,
           '\((([0-2]?\d|3[01]) (JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC)|(JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC) ([0-2]?\d|3[01]), \d{4})\)',
           1,
           LVL + 1
         ),
         LVL + 1
  FROM   datelines
  WHERE  REGEXP_SUBSTR(
           value,
           '\((([0-2]?\d|3[01]) (JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC)|(JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC) ([0-2]?\d|3[01]), \d{4})\)',
           1,
           LVL + 1,
           NULL,
           1
         ) IS NOT NULL
),
actions ( id, dt, lvl, actions ) AS (
  SELECT id,
         dt,
         lvl,
         SUBSTR(
           value,
           pos + LENGTH( dt ) + 2,
           LEAD( pos, 1, LENGTH( value ) + 1 ) OVER ( PARTITION BY id ORDER BY lvl ) - pos - LENGTH( dt ) - 2
         )
  FROM   datelines
),
amounts ( id, dt, lvl, actions, amount, num_amounts, amount_lvl ) AS (
  SELECT id,
         dt,
         lvl,
         actions,
         TO_NUMBER( REGEXP_SUBSTR( actions, '\((\d+)MG\)', 1, 1, NULL, 1 ) ),
         REGEXP_COUNT( actions, '\((\d+)MG\)' ),
         1
  FROM   actions
UNION ALL
  SELECT id,
         dt,
         lvl,
         actions,
         TO_NUMBER( REGEXP_SUBSTR( actions, '\((\d+)MG\)', 1, amount_lvl + 1, NULL, 1 ) ),
         num_amounts,
         amount_lvl + 1
  FROM   amounts
  WHERE  amount_lvl < num_amounts
)
SELECT id,
       dt,
       SUM( amount ) AS total_amount
FROM   amounts
GROUP BY id, dt, lvl;

输出

ID | DT           | TOTAL_AMOUNT
-: | :----------- | -----------:
 1 | SEP 20, 2018 |          290
 1 | 1 JAN        |          250

db <>提琴here


更新

如果每一行都在数据库表的不同行中,那么:

Oracle设置

CREATE TABLE table_name ( id, value ) AS
  SELECT 1, '(1 JAN) INJECT 2ML (100MG) IV/IM AM THEN 0.5ML (25MG) 20 MIN LATER, THEN 2.5ML (125MG) PM' FROM DUAL UNION ALL
  SELECT 2, '(SEP 20, 2018) INJECT 0.3ML (30MG) ON S1, 0.6ML (60MG) ON S2 AND 2ML(200MG) ON S3' FROM DUAL;

查询

WITH amounts ( id, value, dt, amount, amount_index, num_amounts ) AS (
  SELECT id,
         value,
         REGEXP_SUBSTR( value, '\((.*?)\)', 1, 1, NULL, 1 ),
         TO_NUMBER( REGEXP_SUBSTR( value, '\((\d+)MG\)', 1, 1, NULL, 1 ) ),
         1,
         REGEXP_COUNT( value, '\((\d+)MG\)' )
  FROM   table_name
UNION ALL
  SELECT id,
         value,
         dt,
         TO_NUMBER( REGEXP_SUBSTR( value, '\((\d+)MG\)', 1, amount_index + 1, NULL, 1 ) ),
         amount_index + 1,
         num_amounts
  FROM   amounts
  WHERE  amount_index < num_amounts
)
SELECT id,
       MAX( dt ) AS dt,
       SUM( amount ) AS total_amount
FROM   amounts
GROUP BY id;

输出

ID | DT           | TOTAL_AMOUNT
-: | :----------- | -----------:
 1 | 1 JAN        |          250
 2 | SEP 20, 2018 |          290

db <>提琴here