使用Nokogiri解析index.html并使用以下文本指定a.link

时间:2017-10-01 08:17:37

标签: ruby nokogiri

请帮我弄清楚如何正确分配带日期的Build name,然后按上传日期按升序排序所有链接。

Index.html的示例如下所示:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<html>
<head><title>Index of localhost/BUILD</title>
</head>
<body>
<h1>Index of localhost/BUILD</h1>
<pre>Name             Last modified      Size</pre><hr/>
<pre><a href="../">../</a>
<a href="BUILD.10.tar">BUILD.10.tar</a>   27-Sep-2017 15:46  250 bytes
<a href="BUILD.13.tar">BUILD.13.tar</a>   28-Sep-2017 12:14  254 bytes
<a href="BUILD.15.tar">BUILD.15.tar</a>   29-Sep-2017 08:56  257 bytes
<a href="BUILD.16.tar">BUILD.16.tar</a>   29-Sep-2017 08:56  258 bytes
<a href="BUILD.17.tar">BUILD.17.tar</a>   29-Sep-2017 08:56  256 bytes
<a href="BUILD.9.tar">BUILD.9.tar</a>    27-Sep-2017 15:44  247 bytes
</pre>
<hr/><address style="font-size:small;">Artifactory/5.2.1 Server</address></body></html>

目前我的脚本如下所示:

require 'open-uri'
require 'nokogiri'

  build_url = "/home/index.html"
  index_html = open(build_url).read
  index_dom = Nokogiri::HTML.parse index_html

  builds =[]
  links = index_dom.css('a').each { |link|
    build = link.text
    if build.end_with?(".tar")
      builds.push(build)
    end
  }
  rc_builds = []
  builds.sort.each { |b|  rc_builds << b }
  p rc_builds

需要更改此内容以获取Build name和Last modified,并输出rc_builds数组,按上次修改后的升序排序。

无法对index.html进行任何更改。所以解决方案应该基于示例中的index.html页面。

问题是我无法弄清楚如何访问Last Modified text。

2 个答案:

答案 0 :(得分:1)

您可以尝试获取anchor个标签及其旁边的文字。

index_dom = Nokogiri::HTML.parse(html)

# Access the pre tags within the parsed html
builds = index_dom.css('pre').flat_map do |link|
  # Scan for looking the modified at dates 
  text = link.text.scan(/\d+-\w+-\d{4} \d{2}:\d{2}/) # I'm not regex expert, I'm sure this could be better
  # Get all the anchors within the current pre tag
  link.css('a').map.with_index do |anchor, index|
    href = anchor['href']
    # Select the text by the anchor's side if this ends in 'tar'
    [text[index - 1], href] if href.end_with?('.tar')
  end.compact
  # Compact for removing nil due to the interaction with the first pre tag
end

# Sorts the array of arrays by its first value, that's the date
p builds.sort_by(&:first)
# => [["27-Sep-2017 15:46", "BUILD.10.tar"], ["28-Sep-2017 12:14", "BUILD.13.tar"]]

答案 1 :(得分:1)

我就是这样做的:

dom = Nokogiri::HTML.parse index_html

builds =[]

pre =  dom.css('pre')
build_info = pre[1].text

result = []

build_info.split("\n").each do |line|
  next unless line =~ /BUILD/
  arr = line.split(/\s+/)
  result.push({
    build: arr[0],
    modified: "#{arr[1]} #{arr[2]}",
    size: "#{arr[3]}",
    size_unit: "#{arr[4]}"
  })
end


p result

#[{:build=>"BUILD.10.tar", :modified=>"27-Sep-2017 15:46", :size=>"250", :size_unit=>"bytes"}, {:build=>"BUILD.13.tar", :modified=>"28-Sep-2017 12:14", :size=>"254", :size_unit=>"bytes"}, {:build=>"BUILD.15.tar", :modified=>"29-Sep-2017 08:56", :size=>"257", :size_unit=>"bytes"}, {:build=>"BUILD.16.tar", :modified=>"29-Sep-2017 08:56", :size=>"258", :size_unit=>"bytes"}, {:build=>"BUILD.17.tar", :modified=>"29-Sep-2017 08:56", :size=>"256", :size_unit=>"bytes"}, {:build=>"BUILD.9.tar", :modified=>"27-Sep-2017 15:44", :size=>"247", :size_unit=>"bytes"}]