Question

我需要过滤一些格式不正确的文字。因此，有很多情况下，文本中的引号从一行开始，然后切断并在第二行结束。在这种情况下，我的首选是完全删除部分引号，但我想保留常规的完整引号。我知道这可以用计数器迭代完成，但我真的更喜欢用正则表达式来解决它。

以前例为例：

"This is a quote"
This is an end "partial-
quote" Here is more text.
This is an end "partial-
quote w/o more text"
This is an "embedded" quote

Here是我当前尝试的示例 (\"[^\"\n]+?|^[^\"\n]+?\")(\n|$) 请注意，它在两种情况下失败：

第3行 - 部分引用继续句子的其余部分（非常罕见，所以如果我们不能解决它不是世界末日）。
第6行 - 嵌入式报价。这是一个主要问题，也是我解决问题的主要原因。它将嵌入式引用中的最后一个引号抓到行尾。

我认为我可以设置一个if语句并运行每一行，检查它是否少于两个引号，然后继续解析部分引号，但我认为SO的思想会有一个更清晰的解决方案。

注意所需的输出为：

"This is a quote"
This is an end 
 Here is more text.
This is an end 
This is an "embedded" quote

（我稍后处理空白）

Answer 1

你走了，

^((?:[^"\n]*"[^"\n]*")*[^"\n]*)"[^"\n]*\n[^"\n]*"(\n|)

将匹配的字符替换为\1\n

DEMO

>>> import re
>>> s = '''"This is a quote"
This is an end "partial-
quote" Here is more text.
This is an end "partial-
quote w/o more text"
This is an "embedded" quote'''
>>> m = re.sub(r'(?m)^((?:[^"\n]*"[^"\n]*")*[^"\n]*)"[^"\n]*\n[^"\n]*"(\n|)', r'\1\n', s)
>>> print(m)
"This is a quote"
This is an end 
 Here is more text.
This is an end 
This is an "embedded" quote

如果你想处理双引号之间存在的多条线，请使用这个正则表达式。

^((?:[^"\n]*"[^"\n]*")*[^"\n]*)"(?:[^"\n]*\n)+[^"\n]*"(\n|)

DEMO

Answer 2

你可以使用这个正则表达式：

"[^"\n]+?\n[^"\n]+?(?:"|$)\s*

并替换为\n。

regex101 demo

"[^"\n]+?\n[^"\n]+?仅匹配部分引号（确保引号之间有换行符）。

ideone demo

Answer 3

("[^"\n]*")|"[^"]*(\n)[^"]*"(?![^\n]*")|"[^"]*\n.*?(?=\n[^"]*"[^\n"]*")

你可以试试这个。这也将采用奇数引号。参见演示。

https://regex101.com/r/dL7oF8/6

Python Regex仅匹配部分括号

3 个答案: