我有一个段落文档数据库。我想在表“ master_data”上的段落中拆分每个句子 并将其存储到不同的表“ splittext”中。
master_data表:
id | Title | Paragraph
splittext表
id_sen | sentences | doc_id
我尝试使用此查询选择Paragraph.master_data中的每个句子
SELECT Paragraph FROM pyproject.master_data where REGEXP_SUBSTR '[^\.\!\*
[\.\!\?]';
但是会产生括号错误。所以我尝试使用方括号,并产生错误的参数计数错误
SELECT Paragraph FROM pyproject.master_data where REGEXP_SUBSTR '([^\.\!\*
[\.\!\?])';
我的预期结果是该段落被分成句子并存储到新表中。并返回该段落的原始ID,并将其存储在doc_id中。
例如:
master_data:
id | Title | Paragraph |
1 | asds..| I want. Some. Coconut and Banana !! |
2 | wad...| Milkshake? some Nice milk. |
splittext_table:
id| sentences | doc_id |
1| I want | 1 |
2| Some | 1 |
.
.
.
5| Some Nice milk | 2 |
答案 0 :(得分:1)
对于MySQL 8.0,您可以使用recursive CTE,只要指定limitations。
with
recursive r as (
select
1 id,
cast(regexp_substr(
Paragraph, '[^.!?]+(?:[.!?]+|$)'
) as char(256)) sentences,
id doc_id, Title, Paragraph
from master_data
union all
select id + 1,
regexp_substr(
Paragraph, '[^.!?]+(?:[.!?]+|$)',
1, id + 1
),
doc_id, Title, Paragraph
from r
where sentences is not null
)
select id, sentences, doc_id, Title
from r
where sentences is not null or id = 1
order by doc_id, id;
输出:
| id | sentences | doc_id | Title |
+----+-----------------------+--------+--------+
| 1 | I want. | 1 | asds.. |
| 2 | Some. | 1 | asds.. |
| 3 | Coconut and Banana !! | 1 | asds.. |
| 1 | Milkshake? | 2 | wad... |
| 2 | some Nice milk. | 2 | wad... |
| 1 | bar | 3 | foo |
在DB Fiddle上演示。