我确定我只是遗漏了一些东西,但我的正则表达式有点生疏。
我有一个格式正确的文本语料库,它来自一个SQLite数据库,每个评论都是一行,这很好,我把它写成文本文件,所以每个评论是一行后跟一个换行符。
我需要做的是将每个句子转换为一行以提供一个迭代器,该迭代器将句子作为线条然后输入模型。文本都是专业编写和编辑的,因此一个简单的正则表达式基于以[。!?]或[。!?]结尾的字符串分隔行,后跟双引号(")实际上就足够了。
之类的东西re.split('(?<=[.!?]) +|((?<=[.!?])\")', text)
lookbehind适用于除(&#34;)之外的任何事物。我通常在R或Ruby中完成正则表达式,这只会让我在周日晚上的凌晨时感到愚蠢。
示例文字:
“跳跃之旅”最终成为了一个'90年代的妙语,一个音乐新闻简写为“过度夸张的酒店休息室音乐。”但今天,备受诟病的子类似乎是一个秘密的先例。聆听上世纪90年代中期的任何规范的布里斯托尔场景专辑,当时这种类型开始对其界限产生影响,你会认为这个幽闭,焦虑的21世纪比计划提前几年开始。提前感谢任何建议。
答案 0 :(得分:0)
您可以使用
r'(?:(?<=[.!?])|(?<=[.!?]["”]))\s+'
请参阅regex demo
<强>详情
(?:
- 开始非捕获交替组匹配:
(?<=[.!?])
- 紧接着.
,!
或?
|
- 或(?<=[.!?]["”])
- 紧接着.
,!
或?
后跟"
或”
)
- 分组结束\s+
- 1+空格。import re
rx = ur'(?:(?<=[.!?])|(?<=[.!?]["”]))\s+'
s = u"“Trip-hop” eventually became a ’90s punchline, a music-press shorthand for “overhyped hotel lounge music.” But today, the much-maligned subgenre almost feels like a secret precedent. Listen to any of the canonical Bristol-scene albums of the mid-late ’90s, when the genre was starting to chafe against its boundaries, and you’d think the claustrophobic, anxious 21st century started a few years ahead of schedule."
for result in re.split(rx, s):
print(result.encode("utf-8"))
输出:
“Trip-hop” eventually became a ’90s punchline, a music-press shorthand for “overhyped hotel lounge music.”
But today, the much-maligned subgenre almost feels like a secret precedent.
Listen to any of the canonical Bristol-scene albums of the mid-late ’90s, when the genre was starting to chafe against its boundaries, and you’d think the claustrophobic, anxious 21st century started a few years ahead of schedule.