使用多个XPath选择器定义单个链接的页面的解决方法?

时间:2013-11-30 23:23:10

标签: ruby xpath nokogiri

以下代码有效,但不会迭代到下一页。我已经发现有问题的网站使用两个不同的XPath选择器来定义下一页链接,我不确定如何将其实现为代码。

作为对评论的回应,以下是第一页所讨论的选择器的来源:

<table class="pager" cellspacing="0">
    <tr>
        <td>
                    Items 1 to 72 of 1146 total                </td>
                <td class="pages">
            <strong>Page:</strong>
            <ol>
                                                            <li><span class="on">1</span></li>
                                                                <li><a href="http://www.example.com/clothing-accessories?dir=asc&amp;limit=72&amp;order=position&amp;p=2">2</a></li>
                                                                <li><a href="http://www.example.com/clothing-accessories?dir=asc&amp;limit=72&amp;order=position&amp;p=3">3</a></li>
                                                                <li><a href="http://www.example.com/clothing-accessories?dir=asc&amp;limit=72&amp;order=position&amp;p=4">4</a></li>
                                                                <li><a href="http://www.example.com/clothing-accessories?dir=asc&amp;limit=72&amp;order=position&amp;p=5">5</a></li>
                                                        <li><a href="http://www.example.com/clothing-accessories?dir=asc&amp;limit=72&amp;order=position&amp;p=2"><img src="http://www.example.com/skin/frontend/default-mongo/a033/images/pager_arrow_right.gif" alt="Next Page"/></a></li>
                        </ol>
        </td>

        <td class="a-right">
            Show <select onchange="setLocation(this.value)">
                            <option value="http://www.example.com/clothing-accessories?dir=asc&amp;limit=12&amp;order=position">
                    12                </option>
                            <option value="http://www.example.com/clothing-accessories?dir=asc&amp;limit=24&amp;order=position">
                    24                </option>
                            <option value="http://www.example.com/clothing-accessories?dir=asc&amp;limit=48&amp;order=position">
                    48                </option>
                            <option value="http://www.example.com/clothing-accessories?dir=asc&amp;limit=72&amp;order=position" selected="selected">
                    72                </option>
                        </select> per page        </td>

    </tr>
</table>

和所有后续页面上完全相同的选择器:

<table class="pager" cellspacing="0">
    <tr>
        <td>
                    Items 73 to 144 of 1146 total                </td>
                <td class="pages">
            <strong>Page:</strong>
            <ol>
                            <li><a href="http://www.example.com/clothing-accessories?dir=asc&amp;limit=72&amp;order=position&amp;p=1"><img src="http://www.example.com/skin/frontend/default-mongo/a033/images/pager_arrow_left.gif" alt="Previous Page" /></a></li>
                                                            <li><a href="http://www.example.com/clothing-accessories?dir=asc&amp;limit=72&amp;order=position&amp;p=1">1</a></li>
                                                                <li><span class="on">2</span></li>
                                                                <li><a href="http://www.example.com/clothing-accessories?dir=asc&amp;limit=72&amp;order=position&amp;p=3">3</a></li>
                                                                <li><a href="http://www.example.com/clothing-accessories?dir=asc&amp;limit=72&amp;order=position&amp;p=4">4</a></li>
                                                                <li><a href="http://www.example.com/clothing-accessories?dir=asc&amp;limit=72&amp;order=position&amp;p=5">5</a></li>
                                                        <li><a href="http://www.example.com/clothing-accessories?dir=asc&amp;limit=72&amp;order=position&amp;p=3"><img src="http://www.example.com/skin/frontend/default-mongo/a033/images/pager_arrow_right.gif" alt="Next Page"/></a></li>
                        </ol>
        </td>

        <td class="a-right">
            Show <select onchange="setLocation(this.value)">
                            <option value="http://www.example.com/clothing-accessories?dir=asc&amp;limit=12&amp;order=position&amp;p=2">
                    12                </option>
                            <option value="http://www.example.com/clothing-accessories?dir=asc&amp;limit=24&amp;order=position&amp;p=2">
                    24                </option>
                            <option value="http://www.example.com/clothing-accessories?dir=asc&amp;limit=48&amp;order=position&amp;p=2">
                    48                </option>
                            <option value="http://www.example.com/clothing-accessories?dir=asc&amp;limit=72&amp;order=position&amp;p=2" selected="selected">
                    72                </option>
                        </select> per page        </td>

    </tr>
</table>

在结果的第一页上,下一页链接由XPath选择器定义:

//*[@id="bodyblock"]/div/div[2]/div[2]/div[3]/table[3]/tbody/tr/td[2]/ol/li[6]/‌​a

在所有后续页面中,下一页链接由以下内容定义:

//*[@id="bodyblock"]/div/div[2]/div[2]/div[3]/table[3]/tbody/tr/td[2]/ol/li[7]/‌​a

我将更改代码的哪一部分以及如何确保程序迭代到结果的下一页,而不管next_page_link的定义方式如何?

require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'fileutils'

DATA_DIR = "data-hold/clothing-accessories"
Dir.mkdir(DATA_DIR) unless File.exists?(DATA_DIR)
BASE_TOM_URL = "http://www.example.com"

list_url = "#{ BASE_TOM_URL }/clothing-accessories?dir=asc&limit=72&order=position"

loop do

  page = Nokogiri::HTML(open(list_url))
  rows = page.xpath('//*[@id="product-list-table"]/li')

  unless rows.empty?

    rows[1..-2].each do |row|

      hrefs = row.xpath('//*[@id="product-list-table"]/li/div/a').map{ |a| a['href'] }.uniq

      hrefs.each do |href|

        remote_url = href
        local_fname = "#{ DATA_DIR }/#{ File.basename(href) }"

        unless File.exists?(local_fname)

          puts "Fetching #{ remote_url }..."

          begin
            tom_content = open(remote_url).read
            File.write(local_fname, tom_content)
            puts "\t...Success, saved to #{ local_fname }"
            sleep 1.0 + rand
          rescue Exception => e
            puts "Error: #{ e }"
            sleep 5
          end  

        end 

      end 

    end

  end


  next_results_link = page.at('//*[@id="bodyblock"]/div/div[2]/div[2]/div[3]/table[3]/tbody/tr/td[2]/ol/li[7]/a')

  if next_results_link
    list_url = next_results_link['href']
    puts "\t...Getting next page of results: #{list_url}"
  else
    break
  end

end

2 个答案:

答案 0 :(得分:0)

你为什么不这样做:

rows[1..-2].each_with_index do |row, i|

  ...

  xpath_index = if i == 1
    '6'
  else
    '7'
  end

  next_results_link = page.at(%Q!//*[@id="bodyblock"]/div/div[2]/div[2]/div[3]/table[3]/tbody/tr/td[2]/ol/li[#{ xpath_index }]/a!)
  ...

end

这会让你知道它在做什么:

xpath_index = 6
%Q!//*[@id="bodyblock"]/div/div[2]/div[2]/div[3]/table[3]/tbody/tr/td[2]/ol/li[#{ xpath_index }]/a!
# => "//*[@id=\"bodyblock\"]/div/div[2]/div[2]/div[3]/table[3]/tbody/tr/td[2]/ol/li[6]/a"

xpath_index = 7
%Q!//*[@id="bodyblock"]/div/div[2]/div[2]/div[3]/table[3]/tbody/tr/td[2]/ol/li[#{ xpath_index }]/a!
# => "//*[@id=\"bodyblock\"]/div/div[2]/div[2]/div[3]/table[3]/tbody/tr/td[2]/ol/li[7]/a"

另外,您知道,您正在处理XPath中的非ASCII字符。如何到达那里我不知道,但尾随/a无效。目前是:

'/‌​a'.codepoints.to_a # => [47, 8204, 8203, 97]

应该是:

'/a'.codepoints.to_a # => [47, 97]

  

“page.at(%Q!”选择器语法对我来说是新的,我在任何阅读材料中都没有看到它

at是Nokogiri相当于search(some_node_selector, some_name_space).first。这些都记录在Nokogiri::XML::Node.at中。换句话说,它只找到第一个节点并返回它,而search找到匹配的所有节点并将它们作为NodeSet返回。

at同等地接受CSS或XPath选择器。特定于CSS的版本为at_css,特定于XPath的版本为at_xpath。我倾向于使用at,除非我使用的模糊选择器会欺骗Nokogiri做错事。

同样,search同时接受CSS和XPath,而cssxpath分别是CSS和XPath变体。

%Q!...!是另一种定义解释/双引号字符串的方法。除了%Q之外,还有%q%以及正常表达式的%r%x来执行子shell中的命令行应用程序,以及%i,这是Ruby v.2.0。

以下是一些例子:

foo = 'bar'

%Q[a b]        # => "a b"
%Q^a #{ foo }^ # => "a bar"

%[a b]        # => "a b"
%/a #{ foo }/ # => "a bar"

%q#a b#        # => "a b"
%q[a #{ foo }] # => "a \#{ foo }"

%w$a b$ # => ["a", "b"]
%W~a b~ # => ["a", "b"]

%W[a foo]      # => ["a", "foo"]
%W[a #{ foo }] # => ["a", "bar"]

%r.^foo. # => /^foo/
%r!^foo! # => /^foo/
%r/^foo/ # => /^foo/
%x(date) # => "Mon Dec  2 21:13:37 MST 2013\n"

%s[a]   # => :a
%s[a b] # => :"a b"
%i[a b] # => [:a, :b]

请注意,分隔符可以是书本结尾,例如()[],也可以是#!相同的字符。这在处理包含单引号和双引号的字符串时提供了很大的灵活性,并且可以清理“倾斜牙签综合症”行:

"He's quoting Shakesphere's \"The Taming of the Shrew\"" # => "He's quoting Shakesphere's \"The Taming of the Shrew\""
'He\'s quoting Shakesphere\'s "The Taming of the Shrew"' # => "He's quoting Shakesphere's \"The Taming of the Shrew\""
%Q[He's quoting Shakesphere's "The Taming of the Shrew"] # => "He's quoting Shakesphere's \"The Taming of the Shrew\""

注意最后一个在视觉上是如何更清晰,更容易输入。这些只是嵌入式单引号和双引号的简单示例。阅读Wikipedia's article on "Leaning Toothpick Syndrome"以获取更多示例和信息。

答案 1 :(得分:0)

在此链接中,包含替代文本“下一页”的图像。利用这个:

//td[contains(@class, 'pages')]/ol/li/a[img/@alt='Next Page']

如果您更喜欢完整路径,则可以轻松地将此XPath表达式的选择器应用于上面提取的路径的开头。我甚至会更进一步使用//td[contains(@class, 'pages')]//a[img/@alt='Next Page']来进一步将代码与XML结构分离。

对于匹配类属性,您还应该考虑使用更正确的版本,但它会使表达式更复杂一些。看看这个question on matching XML classes