Question

我要做的是：从网站（http://nytm.org/made-in-nyc）解析所有内容完全相同的链接。＆＃34;（租用）＆＃34;然后我会写一个文件＆＃39; jobs.html＆＃39;链接列表。（如果发布这些网站是违规行为，我会迅速删除直接网址。我认为这可能对我要做的事情有所帮助。第一次在网上发帖）

DOM结构：

<article>
<ol>
<li><a href="http://www.waywire.com" target="_self" class="vt-p">#waywire</a></li>
<li><a href="http://1800Postcards.com" target="_self" class="vt-p">1800Postcards.com</a</li>
<li><a href="http://www.adafruit.com/" target="_self" class="vt-p">Adafruit Industries</a></li>
<li><a href="http://www.adafruit.com/jobs/" target="_self" class="vt-p">(hiring)</a</li>

...等

我尝试过：

require 'nokogiri'
require 'open-uri'

def find_jobs
   doc = Nokogiri::HTML(open('http://nytm.org/made-in-nyc'))
   hire_links = doc.css("a").select{|link| link.text == "(hiring)"}
   results = hire_links.each{|link| puts link['href']}

 begin
   file = File.open("./jobs.html", "w")
   file.write("#{results}") 
 rescue IOError => e
 ensure
   file.close unless file == nil
 end

puts hire_links
end

find_jobs

这是Gist

示例结果： [344] #<Nokogiri::XML::Element:0x3fcfa2e2276c name="a" attributes=[#<Nokogiri::XML::Attr:0x3fcfa2e226e0 name="href" value="http://www.zocdoc.com/careers">, #<Nokogiri::XML::Attr:0x3fcfa2e2267c name="target" value="_blank">] children=[#<Nokogiri::XML::Text:0x3fcfa2e1ff1c "(hiring)">]>

所以它成功地将这些条目写入jobs.html文件但它是XML格式的？不确定如何仅定位值并从中创建链接。不知道从哪里开始。谢谢！

Answer 1

尝试使用Mechanize。它利用了Nokogiri，你可以做类似

的事情

require 'mechanize'

browser = Mechanize.new
page = browser.get('http://nytm.org/made-in-nyc')
links = page.links_with(text: /(hiring)/)

然后您将拥有一系列链接对象，您可以获得所需的任何信息。您还可以使用Mechanize提供的link.click方法。

Answer 2

问题在于如何定义results。 results是Nokogiri :: XML :: Element：

的数组

results = hire_links.each{|link| puts link['href']}
p results.class
#=> Array
p results.first.class
#=> Nokogiri::XML::Element

当您将Nokogiri :: XML :: Element写入文件时，您将获得检查它的结果：

puts results.first.inspect
#=> "#<Nokogiri::XML::Element:0x15b9694 name="a" attributes=...."

鉴于您需要每个链接的href属性，您应该在结果中收集它：

results = hire_links.map{ |link| link['href'] }

假设您希望每个href /链接显示为文件中的一行，您可以加入该数组：

File.write('./jobs.html', results.join("\n"))

修改过的脚本：

require 'nokogiri'
require 'open-uri'

def find_jobs
  doc = Nokogiri::HTML(open('http://nytm.org/made-in-nyc'))
  hire_links = doc.css("a").select { |link| link.text == "(hiring)"}
  results = hire_links.map { |link| link['href'] }       
  File.write('./jobs.html', results.join("\n"))
end

find_jobs
#=> produces a jobs.html with:
#=> http://www.20x200.com/jobs/
#=> http://www.8coupons.com/home/jobs
#=> http://jobs.about.com/index.html   
#=> ...

Ruby：我如何用内容/文本解析与Nokogiri的链接？

2 个答案: