我正在尝试在Google图片网页上制作所有图片文件的数组。
我想要一个正则表达式来提取"imagurl="
之后的所有内容并在"&"
之前结束,如此HTML中所示:
<a href="http://www.google.com/imgres?imgurl=http://www.trendytree.com/old-world- christmas/images/20031chapel20031-silent-night-chapel.jpg&imgrefurl=http://www.trendytree.com/old-world-christmas/silent-night-chapel-20031-christmas-ornament-old-world-christmas.html&usg=__YJdf3xc4ydSfLQa9tYnAzavKHYQ=&h=400&w=400&sz=58&hl=en&start=19&zoom=1&tbnid=ajDcsGGs0tgE9M:&tbnh=124&tbnw=124&ei=qagfUbXmHKfv0QHI3oG4CQ&itbs=1&sa=X&ved=0CE4QrQMwEg"><img height="124" width="124" src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRLy5inpSdHxWuE7z3QSZw35JwN3upbBaLr11LR25noTKbSMn9-qrySSg"></a><br><cite title="trendytree.com">trendytree.com</cite><br>Silent Night Chapel <b>20031</b><br>400 × 400 - 58k - jpg</td>
我觉得我可以用正则表达式来做这件事,但我找不到使用正则表达式搜索我解析的文档的方法,但我找不到任何解决方案。
答案 0 :(得分:2)
str = '<a href="http://www.google.com/imgres?imgurl=http://www.trendytree.com/old-world- christmas/images/20031chapel20031-silent-night-chapel.jpg&imgrefurl=http://www.trendytree.com/old-world-christmas/silent-night-chapel-20031-christmas-ornament-old-world-christmas.html&usg=__YJdf3xc4ydSfLQa9tYnAzavKHYQ=&h=400&w=400&sz=58&hl=en&start=19&zoom=1&tbnid=ajDcsGGs0tgE9M:&tbnh=124&tbnw=124&ei=qagfUbXmHKfv0QHI3oG4CQ&itbs=1&sa=X&ved=0CE4QrQMwEg"><img height="124" width="124" src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRLy5inpSdHxWuE7z3QSZw35JwN3upbBaLr11LR25noTKbSMn9-qrySSg"></a><br><cite title="trendytree.com">trendytree.com</cite><br>Silent Night Chapel <b>20031</b><br>400 × 400 - 58k - jpg</td>'
str.split('imgurl=')[1].split('&')[0]
#=> "http://www.trendytree.com/old-world- christmas/images/20031chapel20031-silent-night-chapel.jpg"
这就是你要找的东西吗?
答案 1 :(得分:2)
使用正则表达式的问题是您对URL中参数的顺序有太多了解。如果订单发生变化,或&
消失,则正则表达式将无效。
而是解析URL,然后将值拆分出来:
# encoding: UTF-8
require 'nokogiri'
require 'cgi'
require 'uri'
doc = Nokogiri::HTML.parse('<a href="http://www.google.com/imgres?imgurl=http://www.trendytree.com/old-world-christmas/images/20031chapel20031-silent-night-chapel.jpg&imgrefurl=http://www.trendytree.com/old-world-christmas/silent-night-chapel-20031-christmas-ornament-old-world-christmas.html&usg=__YJdf3xc4ydSfLQa9tYnAzavKHYQ=&h=400&w=400&sz=58&hl=en&start=19&zoom=1&tbnid=ajDcsGGs0tgE9M:&tbnh=124&tbnw=124&ei=qagfUbXmHKfv0QHI3oG4CQ&itbs=1&sa=X&ved=0CE4QrQMwEg"><img height="124" width="124" src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRLy5inpSdHxWuE7z3QSZw35JwN3upbBaLr11LR25noTKbSMn9-qrySSg"></a><br><cite title="trendytree.com">trendytree.com</cite><br>Silent Night Chapel <b>20031</b><br>400 × 400 - 58k - jpg</td>')
img_url = doc.search('a').each do |a|
query_params = CGI::parse(URI(a['href']).query)
puts query_params['imgurl']
end
哪个输出:
http://www.trendytree.com/old-world-christmas/images/20031chapel20031-silent-night-chapel.jpg
使用URI和CGI是因为URI decode_www_form
在尝试解码查询时引发异常。
我也知道使用类似的东西将查询字符串解码为哈希:
Hash[URI(a['href']).query.split('&').map{ |p| p.split('=') }]
那将返回:
{"imgurl"=> "http://www.trendytree.com/old-world-christmas/images/20031chapel20031-silent-night-chapel.jpg", "imgrefurl"=> "http://www.trendytree.com/old-world-christmas/silent-night-chapel-20031-christmas-ornament-old-world-christmas.html", "usg"=>"__YJdf3xc4ydSfLQa9tYnAzavKHYQ", "h"=>"400", "w"=>"400", "sz"=>"58", "hl"=>"en", "start"=>"19", "zoom"=>"1", "tbnid"=>"ajDcsGGs0tgE9M:", "tbnh"=>"124", "tbnw"=>"124", "ei"=>"qagfUbXmHKfv0QHI3oG4CQ", "itbs"=>"1", "sa"=>"X", "ved"=>"0CE4QrQMwEg"}
答案 2 :(得分:1)
获取您想要的所有图片
# get all links
url = 'some-google-images-url'
links = Nokogiri::HTML( open(url) ).css('a')
# get regex match or nil on desired img
img_urls = links.map {|a| a['href'][/imgurl=(.*?)&/, 1] }
# get rid of nils
img_urls.compact
你想要的正则表达式是/imgurl=(.*?)&/
,因为你想要imgurl=
和&
之间的非贪婪匹配,否则贪婪的.*
将把所有内容都带到最后&
{{1}} 1}}在字符串中。