Question

我确定我只是遗漏了一些东西，但我的正则表达式有点生疏。

我有一个格式正确的文本语料库，它来自一个SQLite数据库，每个评论都是一行，这很好，我把它写成文本文件，所以每个评论是一行后跟一个换行符。

我需要做的是将每个句子转换为一行以提供一个迭代器，该迭代器将句子作为线条然后输入模型。文本都是专业编写和编辑的，因此一个简单的正则表达式基于以[。！？]或[。！？]结尾的字符串分隔行，后跟双引号（＆＃34;）实际上就足够了。

之类的东西

re.split('(?<=[.!?]) +|((?<=[.!?])\")', text)

lookbehind适用于除（＆＃34;）之外的任何事物。我通常在R或Ruby中完成正则表达式，这只会让我在周日晚上的凌晨时感到愚蠢。

示例文字：

“跳跃之旅”最终成为了一个'90年代的妙语，一个音乐新闻简写为“过度夸张的酒店休息室音乐。”但今天，备受诟病的子类似乎是一个秘密的先例。聆听上世纪90年代中期的任何规范的布里斯托尔场景专辑，当时这种类型开始对其界限产生影响，你会认为这个幽闭，焦虑的21世纪比计划提前几年开始。

提前感谢任何建议。

Answer 1

您可以使用

r'(?:(?<=[.!?])|(?<=[.!?]["”]))\s+'

请参阅regex demo

<强>详情

(?: - 开始非捕获交替组匹配：
- (?<=[.!?]) - 紧接着.，!或?
- | - 或
- (?<=[.!?]["”]) - 紧接着.，!或?后跟"或”
) - 分组结束
\s+ - 1+空格。

Python 2 demo：

import re
rx = ur'(?:(?<=[.!?])|(?<=[.!?]["”]))\s+'
s = u"“Trip-hop” eventually became a ’90s punchline, a music-press shorthand for “overhyped hotel lounge music.” But today, the much-maligned subgenre almost feels like a secret precedent. Listen to any of the canonical Bristol-scene albums of the mid-late ’90s, when the genre was starting to chafe against its boundaries, and you’d think the claustrophobic, anxious 21st century started a few years ahead of schedule."
for result in re.split(rx, s):
    print(result.encode("utf-8"))

输出：

“Trip-hop” eventually became a ’90s punchline, a music-press shorthand for “overhyped hotel lounge music.”

But today, the much-maligned subgenre almost feels like a secret precedent.

Listen to any of the canonical Bristol-scene albums of the mid-late ’90s, when the genre was starting to chafe against its boundaries, and you’d think the claustrophobic, anxious 21st century started a few years ahead of schedule.

将符号后面的双引号与Python中的正则表达式匹配以分割句子

1 个答案: