Ruby ARGF& RegEx:如何拆分段落回车“\ r \ n”而不是行尾“\ r \ n”

时间:2015-01-10 13:27:11

标签: ruby regex hadoop-streaming

我正在尝试使用ruby中的正则表达式预处理某些文本以输入映射器作业,并希望拆分表示段落的回车符。

该文本将使用ARGF.each作为hadoop流媒体作业的一部分进入映射器

"\"Walter Elliot, born March 1, 1760, married, July 15, 1784, Elizabeth,\r\n"
"daughter of James Stevenson, Esq. of South Park, in the county of\r\n"
"Gloucester, by which lady (who died 1800) he has issue Elizabeth, born\r\n"
"June 1, 1785; Anne, born August 9, 1787; a still-born son, November 5,\r\n"
"1789\"\r\n"
"\r\n"    # <----- this is where I would like to split
"Precisely such had the paragraph originally stood from the printer's\r\n"

完成此操作后,我将选择每行的换行符/回车符。

这看起来像这样:

ARGF.each do |text|

  paragraph = text.split(INSERT_REGEX_HERE)

  #some more blah will happen beyond here
end

更新:

然后,所需的输出是一个数组,如下所示:

[
  [0]  "\"Walter Elliot, born March 1, 1760, married, July 15, 1784, Elizabeth,\r\n"
    "daughter of James Stevenson, Esq. of South Park, in the county of\r\n"
    "Gloucester, by which lady (who died 1800) he has issue Elizabeth, born\r\n"
    "June 1, 1785; Anne, born August 9, 1787; a still-born son, November 5,\r\n"
    "1789\"\r\n"
  [1] "Precisely such had the paragraph originally stood from the printer's\r\n"
]

最终我想要的是以下数组:数组中没有回车符:

[
  [0]  "\"Walter Elliot, born March 1, 1760, married, July 15, 1784, Elizabeth,"
    "daughter of James Stevenson, Esq. of South Park, in the county of"
    "Gloucester, by which lady (who died 1800) he has issue Elizabeth, born"
    "June 1, 1785; Anne, born August 9, 1787; a still-born son, November 5,"
    "1789\""
  [1] "Precisely such had the paragraph originally stood from the printer's"
]

提前感谢任何见解。

2 个答案:

答案 0 :(得分:1)

当你执行ARGF.each do |text|时要小心,text将是每一行,不是整个文本块

您可以提供ARGF.each一个特殊的行分隔符,它会返回两个“行”,这是您案例中的两个段落。

试试这个:

paragraphs = ARGF.each("\r\n\r\n").map{|p| p.gsub("\r\n","")}

首先,将输入拆分为两个段落,然后使用gsub删除不需要的换行符。

答案 1 :(得分:0)

分割文字使用:

result = text.gsub(/(?<!\")\\r\\n|(?<=\\\")\\r\\n/, '').split(/[\r\n]+\"\\r\\n\".*?[\r\n]+/)