Question

我有一个段落文档数据库。我想在表“ master_data”上的段落中拆分每个句子并将其存储到不同的表“ splittext”中。

master_data表：

id | Title | Paragraph

splittext表

id_sen | sentences | doc_id

我尝试使用此查询选择Paragraph.master_data中的每个句子

SELECT Paragraph FROM pyproject.master_data  where REGEXP_SUBSTR '[^\.\!\* 
[\.\!\?]';

但是会产生括号错误。所以我尝试使用方括号，并产生错误的参数计数错误

SELECT Paragraph FROM pyproject.master_data  where REGEXP_SUBSTR '([^\.\!\* 
[\.\!\?])';

我的预期结果是该段落被分成句子并存储到新表中。并返回该段落的原始ID，并将其存储在doc_id中。

例如：

master_data：

id | Title | Paragraph  |
 1 | asds..| I want. Some. Coconut and Banana !! |
 2 | wad...| Milkshake? some Nice milk.          |

splittext_table：

id| sentences | doc_id  |

 1|   I want   |    1    |
 2|   Some     |    1    |
           .
           .
           . 
 5| Some Nice milk |   2   |

Answer 1

对于MySQL 8.0，您可以使用recursive CTE，只要指定limitations。

with
  recursive r as (
      select
        1 id,
        cast(regexp_substr(
               Paragraph, '[^.!?]+(?:[.!?]+|$)'
             ) as char(256)) sentences,
        id doc_id, Title, Paragraph
      from master_data
    union all
      select id + 1,
        regexp_substr(
          Paragraph, '[^.!?]+(?:[.!?]+|$)',
          1, id + 1
        ),
        doc_id, Title, Paragraph
      from r
      where sentences is not null
  )
select id, sentences, doc_id, Title
from r
where sentences is not null or id = 1
order by doc_id, id;

输出：

| id |       sentences       | doc_id | Title  |
+----+-----------------------+--------+--------+
|  1 | I want.               |      1 | asds.. |
|  2 | Some.                 |      1 | asds.. |
|  3 | Coconut and Banana !! |      1 | asds.. |
|  1 | Milkshake?            |      2 | wad... |
|  2 | some Nice milk.       |      2 | wad... |
|  1 | bar                   |      3 | foo    |

在DB Fiddle上演示。

将段落文档拆分为句子

1 个答案: