Question

我有一些数据抓取代码，用于提取图像网址和图像名称（位于标记中）。编写的代码如下所示：

BASE = 'http://antwrp.gsfc.nasa.gov/apod/'

f = open 'http://antwrp.gsfc.nasa.gov/apod/archivepix.html'
html_doc = Nokogiri::HTML(f.read)
html_doc.xpath('//b//a')[0..10].each do |element|
  imgurl = BASE + element.attributes['href'].value
  imgname = element.attributes['innerText']
  puts imgname
  doc = Nokogiri::HTML(open(imgurl).read)
  doc.xpath('//p//a//img').each do |elem|
    small_img = BASE + elem.attributes['src'].value
    puts small_img
  end
end

当我运行该程序时，我得到了这个输出：

http://antwrp.gsfc.nasa.gov/apod/image/1308/twolines_yen_960.jpg

http://antwrp.gsfc.nasa.gov/apod/image/1308/perseids_vangaal_960.jpg

http://antwrp.gsfc.nasa.gov/apod/image/1308/phas_jpl_960.jpg

http://antwrp.gsfc.nasa.gov/apod/image/1308/m74_hubble_960.jpg

http://antwrp.gsfc.nasa.gov/apod/image/1308/tafreshiIMG_4098Trail-s900.jpg

http://antwrp.gsfc.nasa.gov/apod/image/1308/Albrechtsberg_Perseid2012-08-12_voltmer900.jpg

http://antwrp.gsfc.nasa.gov/apod/image/1308/ngc3370_hst_900.jpg

http://antwrp.gsfc.nasa.gov/apod/image/1308/auroraemeteors_boardman_1770.jpg

http://antwrp.gsfc.nasa.gov/apod/image/1308/cone_noajgendler_960.jpg

http://antwrp.gsfc.nasa.gov/apod/image/1308/ioplus_galileo_960.jpg

链接之间的线是我希望出现图像名称的地方（例如：第一张图像的“Moonset from Taiwan”）。我有一种感觉，我无法得到名称的原因是因为它是一个子节点，我没有访问它。有谁知道我应该如何改变imgname变量来返回图像名称？

Answer 1

怎么样？

html_doc.xpath('//b//a')[0..10].each do |element|
  imgurl = BASE + element.attributes['href'].value
  #imgname = element.attributes['innerText']
  imgname = element.content
  puts imgname
  ...
end

element.text或element.inner_text应在您的案例中提供相同的输出

从<a> tag through xpath</a>中的子节点中提取文本

1 个答案: