Ruby:如何在保持分隔符和分隔符的同时拆分字符串的长度> 1?

时间:2013-12-03 22:47:48

标签: ruby regex

以前的相关问题只有长度为== 1的分隔符。

我想要的是以下(例如)

str = 'Hello: Alice Hello: Bob Hello: Charlie Hello: David'
arr = str.magic_split('Hello:')

=> arr[0] = 'Hello: Alice '
   arr[1] = 'Hello: Bob '
   arr[2] = 'Hello: Charlie '
   arr[3] = 'Hello: David'

我尝试过str.scan(/ Hello:/),但不知道如何破解正则表达式以使其正常工作。 非常感谢。

我看到一些答案仅适用于这种特殊情况。让我更具体一点。

我要分割的文件如下所示,分隔符为“证书:”

Certificate:
    Data: ...
    Signature Algorithm: ...
...
-----BEGIN CERTIFICATE-----
F19ibG6uZyBJbmR1c3RyaWVzIEluYzESMBAGA1UECwwJTWV6emFuaW5lMRMwEQYD\n
2O2RV6HR84N2/A5ZPRF8AQMXJCLIR4qMe/d97/1XK6JQQLUI5NaNroUkW3+tjXo/\n
ovl3vom6xOwUUcFDdv2QoCYBVADX7W2RaVP0JGfiDcekOTwtdos/tmsblboR8oEp\n
fbxD45AowT+khXnPDCQWWpslXJoKMBkaWH7ajb+yKfEYGzRPEmq+v/FPMY9mlJhX\n
epciB5FNO5krO+cyhky5tBZTIv7qCu3kc36dcQXIOTakc7CdoVgwLnytebwTqtpG\n
KuLLH46U8Pp3eeiDDBxYJlz6a2bsbtOaKb1CKMFB3x8LLfLbF4Sh+ScDHetkJDh5\n
...
Certificate:
...
Certificate:
...

基本上,在“证书:”之间会有随机的ASCII字符。

再次感谢。

6 个答案:

答案 0 :(得分:5)

试试这个正则表达式:

(Hello:\s+.+?)(?=Hello:|$)

描述

Regular expression visualization

演示

http://rubular.com/r/l5WD6A1a2r

答案 1 :(得分:4)

> str = 'Hello: Alice Hello: Bob Hello: Charlie Hello: David'
 => "Hello: Alice Hello: Bob Hello: Charlie Hello: David"
> str.scan(/Hello: \w+\b/)
 => ["Hello: Alice", "Hello: Bob", "Hello: Charlie", "Hello: David"]

非常依赖于包含字母数字的字符串,但它确实适合您的情况。

答案 2 :(得分:4)

这是使用slice_before

的常见情况
text = "Certificate:
    Data: ...
    Signature Algorithm: ...
...
-----BEGIN CERTIFICATE-----
F19ibG6uZyBJbmR1c3RyaWVzIEluYzESMBAGA1UECwwJTWV6emFuaW5lMRMwEQYD
2O2RV6HR84N2/A5ZPRF8AQMXJCLIR4qMe/d97/1XK6JQQLUI5NaNroUkW3+tjXo/
ovl3vom6xOwUUcFDdv2QoCYBVADX7W2RaVP0JGfiDcekOTwtdos/tmsblboR8oEp
fbxD45AowT+khXnPDCQWWpslXJoKMBkaWH7ajb+yKfEYGzRPEmq+v/FPMY9mlJhX
epciB5FNO5krO+cyhky5tBZTIv7qCu3kc36dcQXIOTakc7CdoVgwLnytebwTqtpG
KuLLH46U8Pp3eeiDDBxYJlz6a2bsbtOaKb1CKMFB3x8LLfLbF4Sh+ScDHetkJDh5
...
Certificate:
...
Certificate:
...
"

certificates = text.lines.slice_before(/^Certificate/).to_a
# => [["Certificate:\n",
#      "    Data: ...\n",
#      "    Signature Algorithm: ...\n",
#      "...\n",
#      "-----BEGIN CERTIFICATE-----\n",
#      "F19ibG6uZyBJbmR1c3RyaWVzIEluYzESMBAGA1UECwwJTWV6emFuaW5lMRMwEQYD\n",
#      "2O2RV6HR84N2/A5ZPRF8AQMXJCLIR4qMe/d97/1XK6JQQLUI5NaNroUkW3+tjXo/\n",
#      "ovl3vom6xOwUUcFDdv2QoCYBVADX7W2RaVP0JGfiDcekOTwtdos/tmsblboR8oEp\n",
#      "fbxD45AowT+khXnPDCQWWpslXJoKMBkaWH7ajb+yKfEYGzRPEmq+v/FPMY9mlJhX\n",
#      "epciB5FNO5krO+cyhky5tBZTIv7qCu3kc36dcQXIOTakc7CdoVgwLnytebwTqtpG\n",
#      "KuLLH46U8Pp3eeiDDBxYJlz6a2bsbtOaKb1CKMFB3x8LLfLbF4Sh+ScDHetkJDh5\n",
#      "...\n"],
#     ["Certificate:\n", "...\n"],
#     ["Certificate:\n", "...\n"]]
#     ["Certificate:\n", "...\n"]]

slice_before遍历一个数组,寻找与模式匹配的行。当它找到它们时会创建前一行的子数组,然后继续寻找下一个匹配。在上面的输出中,您可以看到为每个创建的证书创建单独的子数组。

这是一种非常有用的方法。

如果在切片之后,您想要获取编码证书,请仅提取这些行,因为它们应设置为偏移量:

certificates.first[5 .. 10]
# => ["F19ibG6uZyBJbmR1c3RyaWVzIEluYzESMBAGA1UECwwJTWV6emFuaW5lMRMwEQYD\n",
#     "2O2RV6HR84N2/A5ZPRF8AQMXJCLIR4qMe/d97/1XK6JQQLUI5NaNroUkW3+tjXo/\n",
#     "ovl3vom6xOwUUcFDdv2QoCYBVADX7W2RaVP0JGfiDcekOTwtdos/tmsblboR8oEp\n",
#     "fbxD45AowT+khXnPDCQWWpslXJoKMBkaWH7ajb+yKfEYGzRPEmq+v/FPMY9mlJhX\n",
#     "epciB5FNO5krO+cyhky5tBZTIv7qCu3kc36dcQXIOTakc7CdoVgwLnytebwTqtpG\n",
#     "KuLLH46U8Pp3eeiDDBxYJlz6a2bsbtOaKb1CKMFB3x8LLfLbF4Sh+ScDHetkJDh5\n"]

答案 3 :(得分:2)

有很多方法......

 str = 'Hello: Alice Hello: Bob Hello: Charlie Hello: David'
 str.split("Hello:")[1..-1].map {|s| "Hello:"+s}

 str.split(/(Hello:)/)[1..-1].each_slice(2).map(&:join)

请注意,在后一种方法中,使用了一个正则表达式,其中包含捕获组中的字符串"Hello:"。结果:

 str.split(/(Hello:)/)
   #=> ["", "Hello:", " Alice ", "Hello:", " Bob ",
   #    "Hello:", " Charlie ", "Hello:", " David"] 

,而:

 str.split(/Hello:/)
   #=> ["", " Alice ", " Bob ", " Charlie ", " David"]

答案 4 :(得分:1)

不确定这是否适用于您的特定情况,但您可以尝试:

splitta = "Hello: "
str.split(splitta).drop(1).map { |s| splitta + s }

返回

=> ["Hello: Alice ", "Hello: Bob ", "Hello: Charlie ", "Hello: David"]

答案 5 :(得分:0)

尝试这种模式(Hello:\s*(?:(?:(?!Hello:).)*)) Demo