我有一个分段结构的文本文件,我希望将其拆分为一个包含每个部分的字符串元素的数组。然后根据部分对每个部分的内容进行不同的操作。我现在正在使用irb,并且很可能会将其分解为单独的ruby脚本文件。
我已经从输入文件(分别为“sample”和“sample_file”)创建了一个字符串对象和文件对象,以测试不同的方法。我确定文件读取循环冷却在这里,但我相信一个简单的匹配是我所需要的。
该文件如下所示:
*** Section Header ***
randomly formatted content
multiple lines
*** Another Header (some don't end with asterisk and sometimes space will exist before the asterisk set)
This sections info
**** sub headers sometime occur***
I'm okay with treating this as normal headers for now.
I think sub headers may have something consistent about them.
*** Header ***
info for this section
示例输出:
[*** Section Header ***\r\n\r\n randomly formatted content\r multiple lines, **** Another Header\r this sections info,*** sub header and its info, ...etc.]
是[节的字符串,节的字符串,节的字符串] 由于开放和关闭条件不一致或我需要的多线性,我的大多数尝试都因为并发症而失败。
以下是我最接近的尝试,要么创建不需要的元素(比如包含一个标题的结束星号和另一个标题的开头的字符串),要么只抓取标题。
这匹配标题:
sample.scan(/\*{3}.*/)
这匹配标题和部分,但是从关闭和打开星号创建元素,我不完全理解前面的断言,但我认为解决方案将基于我的搜索解决方案看起来像这样。
sample.scan(/(?<=\*{3}).*?(?=\*{3})/m)
我现在正在努力寻找以空格或星号开头的行,但它还不存在!
sample.scan(/^(\s+\*+|\*+).*/)
非常感谢任何方向。
答案 0 :(得分:3)
Ruby的Enumerable包括slice_before
,它对这类任务非常有用:
str = "*** Section Header ***
randomly formatted content
multiple lines
*** Another Header (some don't end with asterisk and sometimes space will exist before the asterisk set)
This sections info
**** sub headers sometime occur***
I'm okay with treating this as normal headers for now.
I think sub headers may have something consistent about them.
*** Header ***
info for this section
"
str.split("\n").slice_before(/^\s*\*{3}/).to_a
# => [["*** Section Header ***",
# "",
# "randomly formatted content",
# "multiple lines",
# ""],
# [" *** Another Header (some don't end with asterisk and sometimes space will exist before the asterisk set)",
# "",
# "This sections info"],
# [" **** sub headers sometime occur***",
# " I'm okay with treating this as normal headers for now.",
# " I think sub headers may have something consistent about them.",
# "",
# ""],
# ["*** Header ***", " info for this section"]]
使用slice_before
允许我使用一个非常简单的模式来定位一个地标/目标,指示子阵列中断的位置。使用/^\s*\*{3}/
查找以可能的空格字符串开头的行,后跟三个'*'
。一旦找到,就会开始一个新的子阵列。
如果您希望每个子数组实际上是一个字符串而不是块中的一行数组,map(&:join)
是您的朋友:
str.split("\n").slice_before(/^\s*\*{3}/).map(&:join)
# => ["*** Section Header *** randomly formatted content multiple lines",
# " *** Another Header (some don't end with asterisk and sometimes space will exist before the asterisk set) This sections info",
# " **** sub headers sometime occur*** I'm okay with treating this as normal headers for now. I think sub headers may have something consistent about them.",
# " *** Header *** info for this section "]
并且,如果要删除前导和尾随空格,可以将strip
与map
结合使用:
str.split("\n").slice_before(/^\s*\*{3}/).map{ |sa| sa.join.strip }
# => ["*** Section Header *** randomly formatted content multiple lines",
# "*** Another Header (some don't end with asterisk and sometimes space will exist before the asterisk set) This sections info",
# "**** sub headers sometime occur*** I'm okay with treating this as normal headers for now. I think sub headers may have something consistent about them.",
# "*** Header *** info for this section"]
或:
str.split("\n").slice_before(/^\s*\*{3}/).map{ |sa| sa.map(&:strip).join(' ') }
# => ["*** Section Header *** randomly formatted content multiple lines ",
# "*** Another Header (some don't end with asterisk and sometimes space will exist before the asterisk set) This sections info",
# "**** sub headers sometime occur*** I'm okay with treating this as normal headers for now. I think sub headers may have something consistent about them. ",
# "*** Header *** info for this section "]
或:
str.split("\n").slice_before(/^\s*\*{3}/).map{ |sa| sa.join.strip.squeeze(' ') }
# => ["*** Section Header *** randomly formatted content multiple lines",
# "*** Another Header (some don't end with asterisk and sometimes space will exist before the asterisk set) This sections info",
# "**** sub headers sometime occur*** I'm okay with treating this as normal headers for now. I think sub headers may have something consistent about them.",
# "*** Header *** info for this section"]
取决于你想做什么。
按&#34; \ r&#34;分裂在我的真实文件上产生比&#34; \ n&#34;
更好的输出
str.split(/\r?\n/).slice_before(/^\s*\*{3}/).to_a
使用/\r?\n/
,这是一个正则表达式,用于查找可选的回车符后跟换行符。 Windows使用"\r\n"
组合来标记行的结尾,而Mac OS和* nix仅使用"\n"
。通过这样做,您不会将您的代码绑定到仅限Windows。
我不知道slice_before
是否是针对此特定用途而开发的,但我已将其用于撕开文本文件并将其分解为段落,并将网络设备配置拆分为块这使得解析在任何一种情况下都变得容易了。
答案 1 :(得分:1)
有很多方法可以完成你想要做的事情,虽然如果你想使用正则表达式这样的模式可能会起作用(取决于确切的文本,你可能需要稍微调整一下):< / p>
(.*[*].*.+[^*]*)
示例:强>
<强> http://regex101.com/r/aU0xU1/2 强>
<强>代码:强>
<强> http://ideone.com/oMsb50 强>
Aboout模式(.*[*].*.+[^*]*)
:
.* matches any character (except newline)
(Between zero and unlimited times), [greedy]
[*] matches astertik * the literal character *
.* matches any character (except newline)
(Between zero and unlimited times), [greedy]
.+ matches any character (except newline)
(Between one and unlimited times), [greedy]
[^*]* match anything except for an asterik
(Between zero and unlimited times), [greedy]
答案 2 :(得分:0)
答案 3 :(得分:0)
更可读的想法可能是使用前瞻在模式之前拆分:
str.split /(?=\n *\*{3})/