考虑以下字符串:
08/07/2017Peter Praet: Interview with De StandaardInterview with Peter Praet, Member of the Executive Board of the ECB, conducted by Pascal Dendooven and Goele De Cort on 3 July 2017, published on 8 July 2017ENGLISH\n\t\t\t\t\t\t\tOTHER LANGUAGES\n\t\t\t\t\t\t\t(1)\n\t\t\t\t\t\t\t+\n\t\t\t\t\t\t\n\t\t\t\t\t\t\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\tSelect your language\n\t\t\t\t\t\t\t\n\t\t\t\t\t\tNederlandsNL07/07/2017Benoît Cœuré: Interview with Le Monde and La StampaInterview with Benoît Cœuré, Member of the Executive Board of the ECB, conducted by Marie Charrel (Le Monde) and Alessandro Barbera (La Stampa), on 3 July, published on 7 July 2017ENGLISH"
我想在那里提取两个句子,即:
"08/07/2017Peter Praet: Interview with De StandaardInterview with Peter Praet, Member of the Executive Board of the ECB, conducted by Pascal Dendooven and Goele De Cort on 3 July 2017, published on 8 July 2017ENGLISH"
"NederlandsNL07/07/2017Benoît Cœuré: Interview with Le Monde and La StampaInterview with Benoît Cœuré, Member of the Executive Board of the ECB, conducted by Marie Charrel (Le Monde) and Alessandro Barbera (La Stampa), on 3 July, published on 7 July 2017ENGLISH"
我尝试使用[\w]+(?!\\t)
,但这会抓住t
中的t(1
以及其他内容。
这里的语法是什么? 谢谢!
答案 0 :(得分:2)
在这里,分开这个
r'(?:\\[\\ntr])+(?:(?:(?!\\[\\ntr]).)*\\[\\ntr])*'
http://www.regex101.com/r/lNv8VO/1
解释
(?: \\ [\\ntr] )+ # The start of a block of escaped \ or n or t or r
# Get as many as are there (like '\n\n\r\r\t\t\n\\', etc)
(?: # Cluster optional
(?: # ----------
(?! \\ [\\ntr] ) # Not an escaped \ or n or t or r ahead
. # This is ok, consume this
)* # ---------- 0 to many times
\\ [\\ntr] # A required escaped \ or n or t or r at the end
)* # Cluster end, do 0 to many times
注意强>
上述正则表达式将 most 中的文本拆分为2个部分。
如果拆分内容包含非转义的r,n,t,那么您可以允许执行 如果文本低于某个阈值,则进行多次拆分。
@MadPhysicist建议长度为20.我给它40,并在中间使用它
正则表达式,在本节(?:(?:(?!\\[\\ntr]).){0,20}
中给出一个范围。
新的正则表达式是
r'(?s)(?:\\[\\ntr])+(?:\s*(?:(?!\\[\\ntr]).){0,40}?\s*\\[\\ntr])*'
https://regex101.com/r/lNv8VO/3
解释
(?s) # Modifiers: dot-all
(?: \\ [\\ntr] )+ # The start of a block of escaped \ or n or t or r
# Get as many as are there (like '\n\n\r\r\t\t\n\\', etc)
(?: # Cluster optional
\s* # Optional whitespace
(?: # ----------
(?! \\ [\\ntr] ) # Not an escaped \ or n or t or r ahead
. # This is ok, consume this
){0,40}? # ---------- Allow (non-greedy) 0 to 40 characters for multiple sections
\s* # Optional whitespace
\\ [\\ntr] # A required escaped \ or n or t or r at the end
)* # Cluster end, do 0 to many times
答案 1 :(得分:1)
假设\ n和\ t字符实际上是换行符和制表符。尝试:
([^\n\t]*)
然后增加它以摆脱其他语言等。
答案 2 :(得分:1)
在Python中,您可以根据标签和换行符拆分字符串,然后过滤掉太短的错误。
import re
[x for x in re.split('\n\t+', long_string) if len(x) > 20]