sql正则表达式解析文本以添加新行

时间:2013-02-05 22:31:20

标签: sql regex oracle

我正在尝试将一个注释字段作为一大块文本,下面的示例数据就像我将其插入表格一样。

create table test_table
(
job_number number,
notes varchar2(4000)
)

insert into test_table (job_number,notes)
values (12345,1022089483 notes notes notes notes 1022094450 notes notes notes notes 1022095218 notes notes notes notes)

我需要解析它,因此每个音符条目都有一个单独的记录(前面注释的10位数字是unix时间戳)。所以,如果我要导出到管道分隔,它将如下所示:

  

job_number |注释

     

12345 | 1022089483笔记笔记笔记

     

12345 | 1022094450备注备注备注

     

12345 | 1022095218笔记笔记笔记

我真的希望这是有道理的。我很欣赏任何见解。

1 个答案:

答案 0 :(得分:0)

有几种方法可以做到这一点:

SQL> insert into test_table (job_number,notes)
  2  values (12345,'1022089483 notes notes notes notes 1022094450 notes notes notes notes 1022095218 notes notes notes notes');

1 row created.

SQL> insert into test_table (job_number,notes)
  2  values (12346,'1022089483 notes notes notes notes 1022094450 foo 1022095218 test notes 1022493228 the answer is 42');

1 row created.

SQL> commit;

Commit complete.

注意:我使用[0-9]{10}作为我的正则表达式来确定音符(即任何10位数字都被视为音符的开头)。

首先,我们可以采用计算任何给定行中最大音符数的方法,然后使用该行数进行笛卡尔连接。然后过滤掉每个音符:

SQL> with data
  2  as (select job_number, notes,
  3            (length(notes)-length(regexp_replace(notes, '[0-9]{10}', null)))/10 num_of_notes
  4        from test_table t)
  5  select job_number,
  6         substr(d.notes, regexp_instr(d.notes, '[0-9]{10}', 1, rn.l),
  7                       regexp_instr(d.notes||' 0000000000', '[0-9]{10}', 1, rn.l+1)
  8                       -regexp_instr(d.notes, '[0-9]{10}', 1, rn.l) -1
  9               ) note
 10    from data d
 11         cross join (select rownum l
 12                      from dual
 13                    connect by level <= (select max(num_of_notes)
 14                                           from data)) rn
 15   where rn.l <= d.num_of_notes
 16   order by job_number, rn.l;

JOB_NUMBER NOTE
---------- --------------------------------------------------
     12345 1022089483 notes notes notes notes
     12345 1022094450 notes notes notes notes
     12345 1022095218 notes notes notes notes
     12346 1022089483 notes notes notes notes
     12346 1022094450 foo
     12346 1022095218 test notes
     12346 1022493228 the answer is 42

7 rows selected.
只要笔记的数量通常相同(差异越大,那就没问题) 这种情况越严重,因为我们正在进行大量的递归查找。)

在11g中,我们可以使用一个resursive factored子查询来执行与上面相同的操作,但不会做额外的循环:

SQL> with data (job_number, notes, note, num_of_notes, iter)
  2  as (select job_number, notes,
  3             substr(notes, regexp_instr(notes, '[0-9]{10}', 1, 1),
  4                    regexp_instr(notes||' 0000000000', '[0-9]{10}', 1, 2)
  5                    -regexp_instr(notes, '[0-9]{10}', 1, 1) -1
  6                  ),
  7             (length(notes)-length(regexp_replace(notes, '[0-9]{10}', null)))/10 num_of_notes,
  8             1
  9        from test_table
 10      union all
 11     select job_number, notes,
 12             substr(notes, regexp_instr(notes, '[0-9]{10}', 1, iter+1),
 13                    regexp_instr(notes||' 0000000000', '[0-9]{10}', 1, iter+2)
 14                    -regexp_instr(notes, '[0-9]{10}', 1, iter+1) -1
 15                  ),
 16             num_of_notes, iter + 1
 17       from data
 18      where substr(notes, regexp_instr(notes, '[0-9]{10}', 1, iter+1),
 19                    regexp_instr(notes||' 0000000000', '[0-9]{10}', 1, iter+2)
 20                    -regexp_instr(notes, '[0-9]{10}', 1, iter+1) -1
 21                  ) is not null
 22    )
 23  select job_number, note
 24    from data
 25  order by job_number, iter;

JOB_NUMBER NOTE
---------- --------------------------------------------------
     12345 1022089483 notes notes notes notes
     12345 1022094450 notes notes notes notes
     12345 1022095218 notes notes notes notes
     12346 1022089483 notes notes notes notes
     12346 1022094450 foo
     12346 1022095218 test notes
     12346 1022493228 the answer is 42

7 rows selected.

或从10g开始,我们可以使用model子句来组成行:

SQL> with data as (select job_number, notes,
  2                       (length(notes)-length(regexp_replace(notes, '[0-9]{10}', null)))/10 num_of_notes
  3                  from test_table)
  4  select job_number, note
  5    from data
  6  model
  7  partition by (job_number)
  8  dimension by (1 as i)
  9  measures (notes, num_of_notes, cast(null as varchar2(4000)) note)
 10  rules
 11  (
 12    note[for i from 1 to num_of_notes[1] increment 1]
 13      = substr(notes[1],
 14               regexp_instr(notes[1], '[0-9]{10}', 1, cv(i)),
 15               regexp_instr(notes[1]||' 0000000000', '[0-9]{10}', 1, cv(i)+1)
 16               -regexp_instr(notes[1], '[0-9]{10}', 1, cv(i)) -1
 17              )
 18  )
 19  order by job_number, i;

JOB_NUMBER NOTE
---------- --------------------------------------------------
     12345 1022089483 notes notes notes notes
     12345 1022094450 notes notes notes notes
     12345 1022095218 notes notes notes notes
     12346 1022089483 notes notes notes notes
     12346 1022094450 foo
     12346 1022095218 test notes
     12346 1022493228 the answer is 42