将符号后面的双引号与Python中的正则表达式匹配以分割句子

时间:2018-04-16 07:04:57

标签: python regex

我确定我只是遗漏了一些东西,但我的正则表达式有点生疏。

我有一个格式正确的文本语料库,它来自一个SQLite数据库,每个评论都是一行,这很好,我把它写成文本文件,所以每个评论是一行后跟一个换行符。

我需要做的是将每个句子转换为一行以提供一个迭代器,该迭代器将句子作为线条然后输入模型。文本都是专业编写和编辑的,因此一个简单的正则表达式基于以[。!?]或[。!?]结尾的字符串分隔行,后跟双引号(")实际上就足够了。

之类的东西
re.split('(?<=[.!?]) +|((?<=[.!?])\")', text)

lookbehind适用于除(&#34;)之外的任何事物。我通常在R或Ruby中完成正则表达式,这只会让我在周日晚上的凌晨时感到愚蠢。

示例文字:

“跳跃之旅”最终成为了一个'90年代的妙语,一个音乐新闻简写为“过度夸张的酒店休息室音乐。”但今天,备受诟病的子类似乎是一个秘密的先例。聆听上世纪90年代中期的任何规范的布里斯托尔场景专辑,当时这种类型开始对其界限产生影响,你会认为这个幽闭,焦虑的21世纪比计划提前几年开始。

提前感谢任何建议。

1 个答案:

答案 0 :(得分:0)

您可以使用

r'(?:(?<=[.!?])|(?<=[.!?]["”]))\s+'

请参阅regex demo

<强>详情

  • (?: - 开始非捕获交替组匹配:
    • (?<=[.!?]) - 紧接着.!?
    • 的职位
    • | - 或
    • (?<=[.!?]["”]) - 紧接着.!?后跟"
    • 的职位
  • ) - 分组结束
  • \s+ - 1+空格。

Python 2 demo

import re
rx = ur'(?:(?<=[.!?])|(?<=[.!?]["”]))\s+'
s = u"“Trip-hop” eventually became a ’90s punchline, a music-press shorthand for “overhyped hotel lounge music.” But today, the much-maligned subgenre almost feels like a secret precedent. Listen to any of the canonical Bristol-scene albums of the mid-late ’90s, when the genre was starting to chafe against its boundaries, and you’d think the claustrophobic, anxious 21st century started a few years ahead of schedule."
for result in re.split(rx, s):
    print(result.encode("utf-8"))

输出:

“Trip-hop” eventually became a ’90s punchline, a music-press shorthand for “overhyped hotel lounge music.”

But today, the much-maligned subgenre almost feels like a secret precedent.

Listen to any of the canonical Bristol-scene albums of the mid-late ’90s, when the genre was starting to chafe against its boundaries, and you’d think the claustrophobic, anxious 21st century started a few years ahead of schedule.