正则表达式:用表格和换行符分割一个长字符串

时间:2017-07-12 02:00:49

标签: python r regex

考虑以下字符串:

08/07/2017Peter Praet: Interview with De StandaardInterview with Peter Praet,  Member of the Executive Board of the ECB,  conducted by Pascal Dendooven and Goele De Cort on 3 July 2017,  published on 8 July 2017ENGLISH\n\t\t\t\t\t\t\tOTHER LANGUAGES\n\t\t\t\t\t\t\t(1)\n\t\t\t\t\t\t\t+\n\t\t\t\t\t\t\n\t\t\t\t\t\t\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\tSelect your language\n\t\t\t\t\t\t\t\n\t\t\t\t\t\tNederlandsNL07/07/2017Benoît Cœuré: Interview with Le Monde and La StampaInterview with Benoît Cœuré,  Member of the Executive Board of the ECB,  conducted by Marie Charrel (Le Monde) and Alessandro Barbera (La Stampa),  on 3 July,  published on 7 July 2017ENGLISH"

我想在那里提取两个句子,即:

  • "08/07/2017Peter Praet: Interview with De StandaardInterview with Peter Praet, Member of the Executive Board of the ECB, conducted by Pascal Dendooven and Goele De Cort on 3 July 2017, published on 8 July 2017ENGLISH"

  • "NederlandsNL07/07/2017Benoît Cœuré: Interview with Le Monde and La StampaInterview with Benoît Cœuré, Member of the Executive Board of the ECB, conducted by Marie Charrel (Le Monde) and Alessandro Barbera (La Stampa), on 3 July, published on 7 July 2017ENGLISH"

我尝试使用[\w]+(?!\\t),但这会抓住t中的t(1以及其他内容。

这里的语法是什么? 谢谢!

3 个答案:

答案 0 :(得分:2)

在这里,分开这个

r'(?:\\[\\ntr])+(?:(?:(?!\\[\\ntr]).)*\\[\\ntr])*'

http://www.regex101.com/r/lNv8VO/1

解释

 (?: \\ [\\ntr] )+             # The start of a block of escaped \ or n or t or r
                               # Get as many as are there (like '\n\n\r\r\t\t\n\\', etc)
 (?:                           # Cluster optional
      (?:                           # ----------
           (?! \\ [\\ntr] )              # Not an escaped \ or n or t or r ahead
           .                             # This is ok, consume this
      )*                            # ---------- 0 to many times
      \\ [\\ntr]                    # A required escaped \ or n or t or r at the end
 )*                            # Cluster end, do 0 to many times

注意
上述正则表达式将 most 中的文本拆分为2个部分。

如果拆分内容包含非转义的r,n,t,那么您可以允许执行 如果文本低于某个阈值,则进行多次拆分。

@MadPhysicist建议长度为20.我给它40,并在中间使用它 正则表达式,在本节(?:(?:(?!\\[\\ntr]).){0,20}中给出一个范围。

新的正则表达式是

r'(?s)(?:\\[\\ntr])+(?:\s*(?:(?!\\[\\ntr]).){0,40}?\s*\\[\\ntr])*'

https://regex101.com/r/lNv8VO/3

解释

 (?s)                          # Modifiers:  dot-all
 (?: \\ [\\ntr] )+             # The start of a block of escaped \ or n or t or r
                               # Get as many as are there (like '\n\n\r\r\t\t\n\\', etc)
 (?:                           # Cluster optional
      \s*                           # Optional whitespace
      (?:                           # ----------
           (?! \\ [\\ntr] )              # Not an escaped \ or n or t or r ahead
           .                             # This is ok, consume this
      ){0,40}?                      # ---------- Allow (non-greedy) 0 to 40 characters for multiple sections
      \s*                           # Optional whitespace
      \\ [\\ntr]                    # A required escaped \ or n or t or r at the end
 )*                            # Cluster end, do 0 to many times

答案 1 :(得分:1)

假设\ n和\ t字符实际上是换行符和制表符。尝试:

([^\n\t]*)

然后增加它以摆脱其他语言等。

答案 2 :(得分:1)

在Python中,您可以根据标签和换行符拆分字符串,然后过滤掉太短的错误。

import re

[x for x in re.split('\n\t+', long_string) if len(x) > 20]