Question

考虑以下字符串：

08/07/2017Peter Praet: Interview with De StandaardInterview with Peter Praet,  Member of the Executive Board of the ECB,  conducted by Pascal Dendooven and Goele De Cort on 3 July 2017,  published on 8 July 2017ENGLISH\n\t\t\t\t\t\t\tOTHER LANGUAGES\n\t\t\t\t\t\t\t(1)\n\t\t\t\t\t\t\t+\n\t\t\t\t\t\t\n\t\t\t\t\t\t\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\tSelect your language\n\t\t\t\t\t\t\t\n\t\t\t\t\t\tNederlandsNL07/07/2017Benoît Cœuré: Interview with Le Monde and La StampaInterview with Benoît Cœuré,  Member of the Executive Board of the ECB,  conducted by Marie Charrel (Le Monde) and Alessandro Barbera (La Stampa),  on 3 July,  published on 7 July 2017ENGLISH"

我想在那里提取两个句子，即：

"08/07/2017Peter Praet: Interview with De StandaardInterview with Peter Praet, Member of the Executive Board of the ECB, conducted by Pascal Dendooven and Goele De Cort on 3 July 2017, published on 8 July 2017ENGLISH"
"NederlandsNL07/07/2017Benoît Cœuré: Interview with Le Monde and La StampaInterview with Benoît Cœuré, Member of the Executive Board of the ECB, conducted by Marie Charrel (Le Monde) and Alessandro Barbera (La Stampa), on 3 July, published on 7 July 2017ENGLISH"

我尝试使用[\w]+(?!\\t)，但这会抓住t中的t(1以及其他内容。

这里的语法是什么？谢谢！

Answer 1

在这里，分开这个

r'(?:\\[\\ntr])+(?:(?:(?!\\[\\ntr]).)*\\[\\ntr])*'

http://www.regex101.com/r/lNv8VO/1

解释

 (?: \\ [\\ntr] )+             # The start of a block of escaped \ or n or t or r
                               # Get as many as are there (like '\n\n\r\r\t\t\n\\', etc)
 (?:                           # Cluster optional
      (?:                           # ----------
           (?! \\ [\\ntr] )              # Not an escaped \ or n or t or r ahead
           .                             # This is ok, consume this
      )*                            # ---------- 0 to many times
      \\ [\\ntr]                    # A required escaped \ or n or t or r at the end
 )*                            # Cluster end, do 0 to many times

注意
上述正则表达式将 most 中的文本拆分为2个部分。

如果拆分内容包含非转义的r，n，t，那么您可以允许执行如果文本低于某个阈值，则进行多次拆分。

@MadPhysicist建议长度为20.我给它40，并在中间使用它正则表达式，在本节(?:(?:(?!\\[\\ntr]).){0,20}中给出一个范围。

新的正则表达式是

r'(?s)(?:\\[\\ntr])+(?:\s*(?:(?!\\[\\ntr]).){0,40}?\s*\\[\\ntr])*'

https://regex101.com/r/lNv8VO/3

解释

(?s) # Modifiers: dot-all (?: \\ [\\ntr] )+ # The start of a block of escaped \ or n or t or r # Get as many as are there (like '\n\n\r\r\t\t\n\\', etc) (?: # Cluster optional \s* # Optional whitespace (?: # ---------- (?! \\ [\\ntr] ) # Not an escaped \ or n or t or r ahead . # This is ok, consume this ){0,40}? # ---------- Allow (non-greedy) 0 to 40 characters for multiple sections \s* # Optional whitespace \\ [\\ntr] # A required escaped \ or n or t or r at the end )* # Cluster end, do 0 to many times

Answer 2

假设\ n和\ t字符实际上是换行符和制表符。尝试：

([^\n\t]*)

然后增加它以摆脱其他语言等。

Answer 3

在Python中，您可以根据标签和换行符拆分字符串，然后过滤掉太短的错误。

import re

[x for x in re.split('\n\t+', long_string) if len(x) > 20]

正则表达式：用表格和换行符分割一个长字符串

3 个答案: