在读取文件时查找十个URL中的所有链接

时间:2015-10-21 18:59:34

标签: ruby nokogiri

如何在阅读文件时从页面中提取href标记中的所有<a>个选项?

如果我有一个包含目标网址的文本文件:

http://mypage.com/1.html
http://mypage.com/2.html
http://mypage.com/3.html
http://mypage.com/4.html

这是我的代码:

File.open("myfile.txt", "r") do |f|
  f.each_line do |line|
    # set the page_url to the current line 
    page = Nokogiri::HTML(open(line))
    links = page.css("a")
    puts links[0]["href"]
  end
end

2 个答案:

答案 0 :(得分:2)

我会翻转它。我首先解析文本文件并将每一行加载到内存中(假设它的数据集足够小)。然后为您的HTML文档创建一个Nokogiri实例,并提取所有href属性(就像您正在做的那样)。

像这个未经测试的代码:

links = []
hrefs = []

File.open("myfile.txt", "r") do |f|
  f.each_line do |line|
    links << line
  end
end


page = Nokogiri::HTML(html)
page.css("a").each do |tag|
  hrefs << tag['href']
end

links.each do |link|
  if hrefs.include?(link)
    puts "its here"
  end
end

答案 1 :(得分:0)

If all I wanted to do was output the 'href' for each <a>, I'd write something like: File.foreach('myfile.txt') do |url| page = Nokogiri::HTML(open(url)) puts page.search('a').map{ |link| link['href'] } end Of course <a> tags don't have to have a 'href' but puts won't care.