Ruby没有“String#substrings_between(start,end)”,我应该使用什么?

时间:2010-07-09 12:26:19

标签: ruby substring

我有一个非常复杂的字符串,例如:

<p>aaa <font style="color:red">ABCD@@@EFG^&*))*T*^[][][]</p>
<p>bbb <font style="color:red">ABCD@@@EFG^&*))*T*^[][][]</p>
<p>ccc <font style="color:red">ABCD@@@EFG^&*))*T*^[][][]</p>
....

现在我想获得aaabbbccc部分。我不想在这里使用正则表达式,因为将<font style="color:red">ABCD@@@EFG^&*))*T*^[][][]</p>部分转换为正则表达式太复杂了。

我希望有一种方法(比如substrings_between),我可以像这样使用它:

substrings = text.substrings_between('<p>', ' <font style="color:red">ABCD@@@EFG^&*))*T*^[][][]</p>');
substrings # -> [aaa, bbb, ccc]

有这样的方法吗?或者最好的方法是什么?

5 个答案:

答案 0 :(得分:4)

理想情况下,您应该使用适当的解析器解析HTML,例如Nokogiri

也就是说,如果你确定你需要的是两个硬编码的字符串,你可以使用 scan 和正则表达式:

string = '<p>aaa <font style="color:red">ABCD@@@EFG^&*))*T*^[][][]</p>
          <p>bbb <font style="color:red">ABCD@@@EFG^&*))*T*^[][][]</p>
          <p>ccc <font style="color:red">ABCD@@@EFG^&*))*T*^[][][]</p>'

before = Regexp.escape '<p>'
after  = Regexp.escape ' <font style="color:red">ABCD@@@EFG^&*))*T*^[][][]</p>'

substrings = string.scan(/#{before}(.*?)#{after}/).flatten
 => ["aaa", "bbb", "ccc"] 

答案 1 :(得分:2)

以下方法将完成工作

def substring_between(target, match1, match2)
  start_match1 = target.index(match1)
  if start_match1 && start_match2 = target.index(match2, start_match1 + match1.length)
    start_idx = start_match1 + match1.length
    target[start_idx, start_match2 - start_idx]
  else
    nil
  end
end

如果你想在字符串类中创建它作为实例方法,那么这应该适合你

class String
  def substring_between(sub1, sub2)
    match1 = self.index(sub1)
    if match1 && match2 = self.index(sub2, match1 + sub1.length)
      idx = match1 + sub1.length
      self[idx, match2 - idx]
    else
      nil
    end
  end
end

如果开始或结束标记不存在或顺序错误,则两个实现都返回nil。以下测试脚本和结果显示它正常工作

strings = [
'No tags at all',
'<font End tag before start tag <p>',
'<p>End tag at end <font',
'No start tag <font',
'<p>No end tag',
'<p>aaa <font style="color:red">ABCD@@@EFG^&*))*T*^[][][]</p>',
'    <p>bbb <font style="color:red">ABCD@@@EFG^&*))*T*^[][][]</p>',
'<p>ccc     cccc<font style="color:red">ABCD@@@EFG^&*))*T*^[][][]</p>'
]

strings.each do |s|
  puts "Method Test = #{s} Result: |#{substring_between(s, '<p>', '<font')}|"
  puts "String Test = #{s} Result: |#{s.substring_between('<p>', '<font')}|"
end
Method Test = No tags at all Result: ||
String Test = No tags at all Result: ||
Method Test = <font End tag before start tag <p> Result: ||
String Test = <font End tag before start tag <p> Result: ||
Method Test = <p>End tag at end <font Result: |End tag at end |
String Test = <p>End tag at end <font Result: |End tag at end |
Method Test = No start tag <font Result: ||
String Test = No start tag <font Result: ||
Method Test = <p>No end tag Result: ||
String Test = <p>No end tag Result: ||
Method Test = <p>aaa <font style="color:red">ABCD@@@EFG^&*))*T*^[][][]</p> Result: |aaa |
String Test = <p>aaa <font style="color:red">ABCD@@@EFG^&*))*T*^[][][]</p> Result: |aaa |
Method Test =     <p>bbb <font style="color:red">ABCD@@@EFG^&*))*T*^[][][]</p> Result: |bbb |
String Test =     <p>bbb <font style="color:red">ABCD@@@EFG^&*))*T*^[][][]</p> Result: |bbb |
Method Test = <p>ccc     cccc<font style="color:red">ABCD@@@EFG^&*))*T*^[][][]</p> Result: |ccc     cccc|
String Test = <p>ccc     cccc<font style="color:red">ABCD@@@EFG^&*))*T*^[][][]</p> Result: |ccc     cccc|

答案 2 :(得分:1)

使用strip_tags

string = '<span id="span_is"><br><br><u><i>Hi</i></u></span>'
strip_tags(string)  # Will Return  'Hi'

答案 3 :(得分:1)

我认为你必须自己构建这个功能。类似的东西:

def substrings_between str, opening, ending
  i_opening = str.index opening
  i_ending = str.index ending
  res = []
  while i_opening && i_ending
    res << str[i_opening+opening.length .. i_ending]
    str = str[i_ending+ending.length .. -1]
    i_opening = str.index opening
    i_ending = str.index ending
  end
  res
end

(这段代码不像Ruby那么多,但效果很好)。

答案 4 :(得分:1)

我认为您正在寻找的功能可能过于具体,无法在Ruby发行版中使用。

我们可以使用

组装它
String#index(string, offset)

然后我们可以写这样的东西(扩展String):

class String
  def delimited_strings(start_delim, end_delim)
    strings = []
    starts_at = index(start_delim) 
    return strings unless starts_at
    ends_at = index(end_delim, starts_at + start_delim.size)
    while starts_at && ends_at do
      strings << self[starts_at+start_delim.size...ends_at]
      starts_at = index(start_delim, starts_at + end_delim.size)
      ends_at = index(end_delim, starts_at + start_delim.size) if starts_at
    end
    strings
  end
end

s = "<p>aaa<font>xxx</font></p><p>bbb<font>xxx</font></p><p>ccc<font>xxx</font></p>"
s.delimited_strings("<p>", "<font") #=> ["aaa", "bbb", "ccc"]