我有一个非常复杂的字符串,例如:
<p>aaa <font style="color:red">ABCD@@@EFG^&*))*T*^[][][]</p>
<p>bbb <font style="color:red">ABCD@@@EFG^&*))*T*^[][][]</p>
<p>ccc <font style="color:red">ABCD@@@EFG^&*))*T*^[][][]</p>
....
现在我想获得aaa
,bbb
,ccc
部分。我不想在这里使用正则表达式,因为将<font style="color:red">ABCD@@@EFG^&*))*T*^[][][]</p>
部分转换为正则表达式太复杂了。
我希望有一种方法(比如substrings_between
),我可以像这样使用它:
substrings = text.substrings_between('<p>', ' <font style="color:red">ABCD@@@EFG^&*))*T*^[][][]</p>');
substrings # -> [aaa, bbb, ccc]
有这样的方法吗?或者最好的方法是什么?
答案 0 :(得分:4)
理想情况下,您应该使用适当的解析器解析HTML,例如Nokogiri。
也就是说,如果你确定你需要的是两个硬编码的字符串,你可以使用 scan 和正则表达式:
string = '<p>aaa <font style="color:red">ABCD@@@EFG^&*))*T*^[][][]</p>
<p>bbb <font style="color:red">ABCD@@@EFG^&*))*T*^[][][]</p>
<p>ccc <font style="color:red">ABCD@@@EFG^&*))*T*^[][][]</p>'
before = Regexp.escape '<p>'
after = Regexp.escape ' <font style="color:red">ABCD@@@EFG^&*))*T*^[][][]</p>'
substrings = string.scan(/#{before}(.*?)#{after}/).flatten
=> ["aaa", "bbb", "ccc"]
答案 1 :(得分:2)
以下方法将完成工作
def substring_between(target, match1, match2)
start_match1 = target.index(match1)
if start_match1 && start_match2 = target.index(match2, start_match1 + match1.length)
start_idx = start_match1 + match1.length
target[start_idx, start_match2 - start_idx]
else
nil
end
end
如果你想在字符串类中创建它作为实例方法,那么这应该适合你
class String
def substring_between(sub1, sub2)
match1 = self.index(sub1)
if match1 && match2 = self.index(sub2, match1 + sub1.length)
idx = match1 + sub1.length
self[idx, match2 - idx]
else
nil
end
end
end
如果开始或结束标记不存在或顺序错误,则两个实现都返回nil。以下测试脚本和结果显示它正常工作
strings = [
'No tags at all',
'<font End tag before start tag <p>',
'<p>End tag at end <font',
'No start tag <font',
'<p>No end tag',
'<p>aaa <font style="color:red">ABCD@@@EFG^&*))*T*^[][][]</p>',
' <p>bbb <font style="color:red">ABCD@@@EFG^&*))*T*^[][][]</p>',
'<p>ccc cccc<font style="color:red">ABCD@@@EFG^&*))*T*^[][][]</p>'
]
strings.each do |s|
puts "Method Test = #{s} Result: |#{substring_between(s, '<p>', '<font')}|"
puts "String Test = #{s} Result: |#{s.substring_between('<p>', '<font')}|"
end
Method Test = No tags at all Result: || String Test = No tags at all Result: || Method Test = <font End tag before start tag <p> Result: || String Test = <font End tag before start tag <p> Result: || Method Test = <p>End tag at end <font Result: |End tag at end | String Test = <p>End tag at end <font Result: |End tag at end | Method Test = No start tag <font Result: || String Test = No start tag <font Result: || Method Test = <p>No end tag Result: || String Test = <p>No end tag Result: || Method Test = <p>aaa <font style="color:red">ABCD@@@EFG^&*))*T*^[][][]</p> Result: |aaa | String Test = <p>aaa <font style="color:red">ABCD@@@EFG^&*))*T*^[][][]</p> Result: |aaa | Method Test = <p>bbb <font style="color:red">ABCD@@@EFG^&*))*T*^[][][]</p> Result: |bbb | String Test = <p>bbb <font style="color:red">ABCD@@@EFG^&*))*T*^[][][]</p> Result: |bbb | Method Test = <p>ccc cccc<font style="color:red">ABCD@@@EFG^&*))*T*^[][][]</p> Result: |ccc cccc| String Test = <p>ccc cccc<font style="color:red">ABCD@@@EFG^&*))*T*^[][][]</p> Result: |ccc cccc|
答案 2 :(得分:1)
string = '<span id="span_is"><br><br><u><i>Hi</i></u></span>'
strip_tags(string) # Will Return 'Hi'
答案 3 :(得分:1)
我认为你必须自己构建这个功能。类似的东西:
def substrings_between str, opening, ending
i_opening = str.index opening
i_ending = str.index ending
res = []
while i_opening && i_ending
res << str[i_opening+opening.length .. i_ending]
str = str[i_ending+ending.length .. -1]
i_opening = str.index opening
i_ending = str.index ending
end
res
end
(这段代码不像Ruby那么多,但效果很好)。
答案 4 :(得分:1)
我认为您正在寻找的功能可能过于具体,无法在Ruby发行版中使用。
我们可以使用
组装它String#index(string, offset)
然后我们可以写这样的东西(扩展String):
class String
def delimited_strings(start_delim, end_delim)
strings = []
starts_at = index(start_delim)
return strings unless starts_at
ends_at = index(end_delim, starts_at + start_delim.size)
while starts_at && ends_at do
strings << self[starts_at+start_delim.size...ends_at]
starts_at = index(start_delim, starts_at + end_delim.size)
ends_at = index(end_delim, starts_at + start_delim.size) if starts_at
end
strings
end
end
s = "<p>aaa<font>xxx</font></p><p>bbb<font>xxx</font></p><p>ccc<font>xxx</font></p>"
s.delimited_strings("<p>", "<font") #=> ["aaa", "bbb", "ccc"]