将段落文档拆分为句子

时间:2019-08-28 07:10:10

标签: mysql regex regexp-substr

我有一个段落文档数据库。我想在表“ master_data”上的段落中拆分每个句子  并将其存储到不同的表“ splittext”中。

master_data表:

id | Title | Paragraph

splittext表

id_sen | sentences | doc_id 

我尝试使用此查询选择Para​​graph.master_data中的每个句子

SELECT Paragraph FROM pyproject.master_data  where REGEXP_SUBSTR '[^\.\!\* 
[\.\!\?]'; 

但是会产生括号错误。所以我尝试使用方括号,并产生错误的参数计数错误

SELECT Paragraph FROM pyproject.master_data  where REGEXP_SUBSTR '([^\.\!\* 
[\.\!\?])'; 

我的预期结果是该段落被分成句子并存储到新表中。并返回该段落的原始ID,并将其存储在doc_id中。

例如:

master_data:

id | Title | Paragraph  |
 1 | asds..| I want. Some. Coconut and Banana !! |
 2 | wad...| Milkshake? some Nice milk.          |

splittext_table:

id| sentences | doc_id  |

 1|   I want   |    1    |
 2|   Some     |    1    |
           .
           .
           . 
 5| Some Nice milk |   2   |

1 个答案:

答案 0 :(得分:1)

对于MySQL 8.0,您可以使用recursive CTE,只要指定limitations

with
  recursive r as (
      select
        1 id,
        cast(regexp_substr(
               Paragraph, '[^.!?]+(?:[.!?]+|$)'
             ) as char(256)) sentences,
        id doc_id, Title, Paragraph
      from master_data
    union all
      select id + 1,
        regexp_substr(
          Paragraph, '[^.!?]+(?:[.!?]+|$)',
          1, id + 1
        ),
        doc_id, Title, Paragraph
      from r
      where sentences is not null
  )
select id, sentences, doc_id, Title
from r
where sentences is not null or id = 1
order by doc_id, id;

输出:

| id |       sentences       | doc_id | Title  |
+----+-----------------------+--------+--------+
|  1 | I want.               |      1 | asds.. |
|  2 | Some.                 |      1 | asds.. |
|  3 | Coconut and Banana !! |      1 | asds.. |
|  1 | Milkshake?            |      2 | wad... |
|  2 | some Nice milk.       |      2 | wad... |
|  1 | bar                   |      3 | foo    |

DB Fiddle上演示。