修复在它们中间有换行符的句子:Python是\ n很有趣

时间:2015-10-01 21:20:20

标签: python regex nltk

我目前正在使用Apache Tika从PDF中提取文本。我正在使用NLTK进行命名实体识别和其他任务。我遇到的问题是,pdf文档中的句子是在它们中间用换行符提取的。例如,

  

我是一个在其中间有一条python line \ nbreak的句子。

图案通常是一个空格,后跟换行符<space>\n或有时<space>\n<space>。我想修复这些句子,以便我可以使用句子标记器。

我正在尝试使用正则表达式模式(.+?)(?:\r\n|\n)(.+[.!?]+[\s|$])\n替换为

的问题:

  1. 在另一句话结束后从同一行开始的句子不匹配。
  2. 如何匹配多行中包含换行符的句子?换句话说,如何允许多次出现(?:\r\n|\n)

    text = """
    Random Data, Company
    2015
    
    This is a sentence that has line 
    break in the middle of it due to extracting from a PDF.
    
    How do I support
    3 line sentence 
    breaks please?
    
    HEADER HERE
    
    The first sentence will 
    match. However, this line will not match
    for some reason 
    that I cannot figure out.
    
    Portfolio: 
    http://DoNotMatchMeBecauseIHaveAPeriodInMe.com 
    
    Full Name 
    San Francisco, CA  
    94000
    
    1500 testing a number as the first word in
    a broken sentence.
    
    Match sentences with capital letters on the next line like 
    Wi-Fi.
    
    This line has 
    trailing spaces after exclamation mark!       
    """
    import re
    new_text = re.sub(pattern=r'(.+?)(?:\r\n|\n)(.+[.!?]+[\s|$])', repl='\g<1>\g<2>', string=text, flags=re.MULTILINE)
    print(new_text)
    
    expected_result = """
    Random Data, Company
    2015
    
    This is a sentence that has line break in the middle of it due to extracting from a PDF.
    
    How do I support 3 line sentence breaks please?
    
    HEADER HERE
    
    The first sentence will match. However, this line will not match for some reason that I cannot figure out.
    
    Portfolio: 
    http://DoNotMatchMeBecauseIHaveAPeriodInMe.com 
    
    Full Name 
    San Francisco, CA  
    94000
    
    1500 testing a number as the first word in a broken sentence.
    
    Match sentences with capital letters on the next line like Wi-Fi.
    
    This line has trailing spaces after exclamation mark!       
    """
    
  3. gitkit.js

1 个答案:

答案 0 :(得分:3)

正则表达式与末尾有空格的行不匹配,句子被分成3行。结果,这句话没有合并成一个。

这是一个备用正则表达式,它将两个空行之间的所有行连接成一个,确保连接行之间只有一个空格:

# The new regex
(\S)[ \t]*(?:\r\n|\n)[ \t]*(\S)
# The replacement string: \1 \2

说明这会搜索任何非空格字符\S后跟一个新行,然后再搜索空格,然后再搜索\S。它用一个空格替换两个'\ S'之间的换行符和空格。由于\s匹配新行,因此明确给出了空格和制表符。这是demo link