用模式解析多行文本

时间:2017-02-12 13:27:39

标签: ruby regex string split

这是一个小例子:

02-09-17 1:01 PM - Some User (Add comments)
Hello,

How are you?

Regards,

02-09-17 3:29 PM - Another User (Add comments)
Hey,

Thanks, all is fine.

Some another text here.

02-09-17 4:30 AM - Just a User (Add comments)
some text
with
multiline

我想解析并处理这三条评论。最好的方法是什么?

尝试这样的正则表达式 - http://www.rubular.com/r/k1CHJ1STTD但是/m标志有问题。没有正则表达式的多行标志 - 不能抓住评论的“正文”。

还试图通过正则表达式进行拆分:

text_above.split(/^(\d{1,2}-\d{1,2}-\d{2} \d{1,2}:\d{1,2} [AP]M - .+ \(Add comments\))/)
=> ["",
"02-09-17 1:01 PM - Some User (Add comments)",
"\n" + "Hello,\n" + "\n" + "How are you?\n" + "\n" + "Regards,\n" + "\n",
"02-09-17 3:29 PM - Another User (Add comments)",
"\n" + "Hey,\n" + "\n" + "Thanks, all is fine.\n" + "\n" + "Some another text     here.\n" + "\n",
"02-09-17 4:30 AM - Just a User (Add comments)",
"\n" + "some text\n" + "with\n" + "multiline\n" + "\n",
"02-09-17 5:29 PM - Another User (Add comments)",
"\n" + "Hey,\n" + "\n" + "Thanks, all is fine.\n" + "\n" + "Some another text here.\n" + "\n",
"02-09-17 6:30 AM - Just a User (Add comments)",
"\n" + "some text\n" + "with\n" + "multiline\n"]

但这不是一个舒适的解决方案。

理想情况下,我希望通过三个或两个组匹配来获取正则表达式捕获,例如:

1. 02-09-17 1:01 PM
2. Some User (Add comments)
3. Hello,

How are you?

Regards,

对于每个评论,或者,评论数组:

[['02-09-17 1:01 PM - Some User (Add comments) Hello,

How are you?

Regards,'],[...]]

有什么想法吗?感谢。

3 个答案:

答案 0 :(得分:2)

你可以使用两个split来保持简单(一个用于整个字符串,一个用于每个块):

text.split(/\n\n(?=\d\d-)/).map { |m| m.split(/ - |\n/, 3) }

你也可以使用scan方法,但它更加挑剔:

text.scan(/([\d-]+[^-]+) - (.*)\n(.*(?>\n.*)*?(?=\n\n\d\d-|\z))/)

答案 1 :(得分:1)

slice_before可能比巨大的scan更容易理解,并且它具有保持模式(split将其删除)的优势

data = text.each_line.slice_before(/^\d\d\-\d\d\-\d\d/).map do |block|
  time, user = block.shift.strip.split(' - ')
  [time, user, block.join.strip]
end

p data
# [["02-09-17 1:01 PM",
#   "Some User (Add comments)",
#   "Hello,\n\nHow are you?\n\nRegards,"],
#  ["02-09-17 3:29 PM",
#   "Another User (Add comments)",
#   "Hey,\n\nThanks, all is fine.\n\nSome another text here."],
#  ["02-09-17 4:30 AM",
#   "Just a User (Add comments)",
#   "some text\nwith\nmultiline"]]

答案 2 :(得分:0)

您可以使用此正则表达式:

(\d{2}-\d{2}-\d{2} \d{1,2}:\d{2} (?:AM|PM)) - (.*?)\r?\n((?:.|\r?\n)+?)(?=\r?\n\d{2}-\d{2}-\d{2} \d{1,2}:\d{2} (?:AM|PM) - |$)
  • (\d{2}-\d{2}-\d{2} \d{1,2}:\d{2} (?:AM|PM))匹配第一组,日期和时间。日期必须由三个数字组成,用短划线分隔,然后是AM / PM时间
  • (.*?)\r?\n((?:.|\r?\n)+?)将第一个换行符(\r?\n)的用户名作为第二个组进行匹配。之后,包括换行符在内的任何内容都匹配并构建第三组评论。
  • 这不起作用,因为它会处理从注释开始到文件结尾的所有内容作为注释。因此,您需要选择下一个日期/时间格式,以便它停在那里。您可以通过在评论和匹配非贪婪之后重复日期/时间格式来执行此操作,但这将包括当前匹配中已存在的下一个日期时间,因此在下一个匹配中将其排除(这将导致跳过每个第二场比赛)。为了避免这种情况,您可以使用积极的前瞻:(?=\r?\n\d{2}-\d{2}-\d{2} \d{1,2}:\d{2} (?:AM|PM) - |$)。这之后匹配一个数字,但不包括在匹配中。最后一条评论必须在字符串$的末尾结束。
  • 您需要使用全局标记/g,但不能使用多行标记/g,因为注释的匹配会超过多行。

以下是一个实例:https://regex101.com/r/o63GQE/2