在下一行提取文本,忽略空格

时间:2015-12-12 15:22:53

标签: ruby regex parsing pdf

Here,我问过如何匹配字符串后面的下一行。

有时,我的PDF包含一些扭曲了我的结果的空白区域。例如,有时我有:

Title:  
this is the text I'd like to extract  
Not this one
Neither this  
(here my code works well)  

有时,它的格式如下:

Title:

this is the text I'd like to extract  
Not this one  
Neither this  

这是我在Ruby中的正则表达式:

^(?<=Title:\n)([^\n]+$)

如果我的匹配文本是cacharecters [sic](文本或数字)而不是空格,我怎样才能使正则表达式提取下一行?

2 个答案:

答案 0 :(得分:0)

\S

不是空白

\s

空格。

^(?<=Title:\n)([^\n\S]+$)

可能不完全正确,但您应该能够掌握如何使用它的要点。基本上,您需要运行if else语句来确定在根据是否存在匹配空间到达下一个字符之前需要循环的额外新行数。我在代码中添加的内容应该是这样的。

Start at a newline(\n) that does not have a white space(\S) before the matched string($).

答案 1 :(得分:0)

如果您已将整个文件读入字符串:

text =
"Title:

this is the text I'd like to extract  
Not this one  
Neither this"
你可以写:

r = /
    \b          # Match a word break
    Title:\s*\n # Match string
    \n*         # Match >= 0 newlines
    \K          # Forget everything matched so far
    [^\n]+      # Match as many characters as possible other than new lines
    /x          # Extended/free-spacing regex definition mode

text[r]
  #=> "this is the text I'd like to extract  " 

另一种方式(在许多方面)是:

lines = text.split(/\n+/)
  #=> ["Title:", "this is the text I'd like to extract  ",
  #    "Not this one  ", "Neither this"] 
lines[lines.index { |l| l.start_with?("Title:") } + 1]
  #=> "this is the text I'd like to extract  "