如何在引号之间找到某些文本

时间:2013-06-19 19:36:52

标签: ruby flickr

我正在尝试编写一个Ruby脚本,它将从图像中获取Flickr BBCode,只找到实际的图像链接并忽略所有其他内容。

来自Flickr的BBCode看起来像这样:

<a href="http://www.flickr.com/photos/user/9049969465/" title="Wiggle Wiggle by Anonymous, on Flickr"><img src="https://farm3.staticflickr.com/2864/92917419471_248187_c.jpg" width="800" height="526" alt="Wiggle Wiggle"></a>

我试图让我的输出只是链接,所以:https://farm3.staticflickr.com/2864/92917419471_248187_c.jpg

到目前为止,我的代码是

#!/usr/bin/ruby

require 'rubygems'

str1 = ""

puts "What text would you like me to use? "
text = gets

text.scan(/"([^"]*)"/) { str1 = $1}

puts str1

我需要知道如何扫描输入,只找到以https开头并以引号结束的部分。任何帮助表示赞赏

3 个答案:

答案 0 :(得分:2)

不要尝试parse HTML with a regex

相反,请使用HTML解析器。像Nokogiri http://nokogiri.org/

这样的东西
require 'nokogiri'
doc = Nokogiri::HTML.parse '<a href="http://www.flickr.com/photos/user/9049969465/" title="Wiggle Wiggle by Anonymous, on Flickr"><img src="https://farm3.staticflickr.com/2864/92917419471_248187_c.jpg" width="800" height="526" alt="Wiggle Wiggle"></a>'

doc.css('a').each do |link|
  puts link.attr(:href)
end

答案 1 :(得分:1)

如果您正在尝试解析HTML,那么您应该使用正确的HTML解析器。

例如,这在Nokogiri

中是微不足道的
require 'nokogiri'

bbcode = %Q[<a href="http://www.flickr.com/photos/user/9049969465/" title="Wiggle Wiggle by Anonymous, on Flickr"><img src="https://farm3.staticflickr.com/2864/92917419471_248187_c.jpg" width="800" height="526" alt="Wiggle Wiggle"></a>]

Nokogiri::HTML(bbcode).css('a')[0]['href']
# => "http://www.flickr.com/photos/user/9049969465/"

你显然必须在那里添加一些错误检查,但这是基础。

答案 2 :(得分:0)

require 'nokogiri'

doc = Nokogiri::HTML (<<-eol)
<a href="http://www.flickr.com/photos/user/9049969465/" title="Wiggle Wiggle by Anonymous, on Flickr"><img src="https://farm3.staticflickr.com/2864/92917419471_248187_c.jpg" width="800" height="526" alt="Wiggle Wiggle"></a>
eol
doc.at_css("a")['href']
# => "http://www.flickr.com/photos/user/9049969465/"
doc.at("a")['href']
# => "http://www.flickr.com/photos/user/9049969465/"