Ruby:如何在保留分隔符的同时在正则表达式上拆分字符串?

时间:2015-04-21 04:35:17

标签: ruby regex string split

这里有been asked multiple times,但从来没有得到通用答案,所以我们走了:

假设您有一个字符串,任何字符串,但让我们使用"oruh43451rohcs56oweuex59869rsr",并且您想要使用正则表达式将其拆分。任何正则表达式,但让我们使用一系列数字:/\d+/。然后你使用split

"oruh43451rohcs56oweuex59869rsr".split(/\d+/)
# => ["oruh", "rohcs", "oweuex", "rsr"]

这很可爱,但我想要数字。所以我们有scan

"oruh43451rohcs56oweuex59869rsr".scan(/\d+/)
# => ["43451", "56", "59869"]

但我想要一切!比方说,split_and_scan?不。

splitscan然后zip呢?让我在那里阻止你。

好的,怎么样?

2 个答案:

答案 0 :(得分:5)

如果split的模式包含捕获组,则该组将包含在结果数组中。

str = "oruh43451rohcs56oweuex59869rsr"
str.split(/(\d+)/)
# => ["oruh", "43451", "rohcs", "56", "oweuex", "59869", "rsr"]

如果您想要压缩,

str.split(/(\d+)/).each_slice(2).to_a
# => [["oruh", "43451"], ["rohcs", "56"], ["oweuex", "59869"], ["rsr"]]

答案 1 :(得分:1)

我很高兴你问......好吧,String#shatter来自Facets。我不喜欢它,因为它是使用技巧实现的(看看源代码,它是可爱的聪明技巧,但如果你的字符串实际上包含"\1"怎么办?)。

所以我自己动手了。这是你得到的:

"oruh43451rohcs56oweuex59869rsr".unjoin(/\d+/)
# => ["oruh", "43451", "rohcs", "56", "oweuex", "59869", "rsr"]

这是实施:

class Object
  def unfold(&f)
    (m, n = f[self]).nil? ? [] : n.unfold(&f).unshift(m)
  end
end

class String
  def unjoin(rx)
    unfold do |s|
      next if s.empty?
      ix = s =~ rx
      case
      when ix.nil?; [s , ""]
      when ix == 0; [$&, $']
      when ix >  0; [$`, $& + $']
      end
    end
  end
end

(底部的verbosier版本)

以下是处理角落案件的一些例子:

"".unjoin(/\d+/)     # => []
"w".unjoin(/\d+/)    # => ["w"]
"1".unjoin(/\d+/)    # => ["1"]
"w1".unjoin(/\d+/)   # => ["w", "1"]
"1w".unjoin(/\d+/)   # => ["1", "w"]
"1w1".unjoin(/\d+/)  # => ["1", "w", "1"]
"w1w".unjoin(/\d+/)  # => ["w", "1", "w"]

就是这样,但是更多......

或者,如果你不喜欢使用内置类...那么,你可以使用Refinements ......但是如果你真的不喜欢它,那么它就像函数一样:

def unfold(x, &f)
  (m, n = f[x]).nil? ? [] : unfold(n, &f).unshift(m)
end

def unjoin(s, rx)
  unfold(s) do |s|
    next if s.empty?
    ix = s =~ rx
    case
    when ix.nil?; [s , ""]
    when ix == 0; [$&, $']
    when ix >  0; [$`, $& + $']
    end
  end
end

我也觉得它可能并不总是很清楚哪些是分隔符,哪些是分隔的位,所以这里有一点补充,让你用#joint?查询字符串,知道它之前扮演的角色分裂:

class String

  def joint?
    false
  end

  class Joint < String
    def joint?
      true
    end
  end

  def unjoin(rx)
    unfold do |s|
      next if s.empty?
      ix = s =~ rx
      case
      when ix.nil?; [s, ""]
      when ix == 0; [Joint.new($&), $']
      when ix >  0; [$`, $& + $']
      end
    end
  end
end

在这里它正在使用中:

"oruh43451rohcs56oweuex59869rsr".unjoin(/\d+/)\
  .map { |s| s.joint? ? "(#{s})" : s }.join(" ")
# => "oruh (43451) rohcs (56) oweuex (59869) rsr"

您现在可以轻松重新实现splitscan

class String

  def split2(rx)
    unjoin(rx).reject(&:joint?)
  end

  def scan2(rx)
    unjoin(rx).select(&:joint?)
  end

end

"oruh43451rohcs56oweuex59869rsr".split2(/\d+/)
# => ["oruh", "rohcs", "oweuex", "rsr"]

"oruh43451rohcs56oweuex59869rsr".scan2(/\d+/)
# => ["43451", "56", "59869"]

如果你讨厌匹配全局和简洁......

class Object
  def unfold(&map_and_next)
    result = map_and_next.call(self)
    return [] if result.nil?
    mapped_value, next_value = result
    [mapped_value] + next_value.unfold(&map_and_next)
  end
end

class String
  def unjoin(regex)
    unfold do |tail_string|
      next if tail_string.empty?
      match = tail_string.match(regex)
      index = match.begin(0)
      case
      when index.nil?; [tail_string, ""]
      when index == 0; [match.to_s, match.post_match]
      when index >  0; [match.pre_match, match.to_s + match.post_match]
      end
    end
  end
end