我正在尝试编写一个Ruby脚本,它将从图像中获取Flickr BBCode,只找到实际的图像链接并忽略所有其他内容。
来自Flickr的BBCode看起来像这样:
<a href="http://www.flickr.com/photos/user/9049969465/" title="Wiggle Wiggle by Anonymous, on Flickr"><img src="https://farm3.staticflickr.com/2864/92917419471_248187_c.jpg" width="800" height="526" alt="Wiggle Wiggle"></a>
我试图让我的输出只是链接,所以:https://farm3.staticflickr.com/2864/92917419471_248187_c.jpg
到目前为止,我的代码是
#!/usr/bin/ruby
require 'rubygems'
str1 = ""
puts "What text would you like me to use? "
text = gets
text.scan(/"([^"]*)"/) { str1 = $1}
puts str1
我需要知道如何扫描输入,只找到以https开头并以引号结束的部分。任何帮助表示赞赏
答案 0 :(得分:2)
相反,请使用HTML解析器。像Nokogiri http://nokogiri.org/
这样的东西require 'nokogiri'
doc = Nokogiri::HTML.parse '<a href="http://www.flickr.com/photos/user/9049969465/" title="Wiggle Wiggle by Anonymous, on Flickr"><img src="https://farm3.staticflickr.com/2864/92917419471_248187_c.jpg" width="800" height="526" alt="Wiggle Wiggle"></a>'
doc.css('a').each do |link|
puts link.attr(:href)
end
答案 1 :(得分:1)
如果您正在尝试解析HTML,那么您应该使用正确的HTML解析器。
例如,这在Nokogiri:
中是微不足道的require 'nokogiri'
bbcode = %Q[<a href="http://www.flickr.com/photos/user/9049969465/" title="Wiggle Wiggle by Anonymous, on Flickr"><img src="https://farm3.staticflickr.com/2864/92917419471_248187_c.jpg" width="800" height="526" alt="Wiggle Wiggle"></a>]
Nokogiri::HTML(bbcode).css('a')[0]['href']
# => "http://www.flickr.com/photos/user/9049969465/"
你显然必须在那里添加一些错误检查,但这是基础。
答案 2 :(得分:0)
require 'nokogiri'
doc = Nokogiri::HTML (<<-eol)
<a href="http://www.flickr.com/photos/user/9049969465/" title="Wiggle Wiggle by Anonymous, on Flickr"><img src="https://farm3.staticflickr.com/2864/92917419471_248187_c.jpg" width="800" height="526" alt="Wiggle Wiggle"></a>
eol
doc.at_css("a")['href']
# => "http://www.flickr.com/photos/user/9049969465/"
doc.at("a")['href']
# => "http://www.flickr.com/photos/user/9049969465/"