在Ruby中使用Mechanize获取表

时间:2014-12-09 02:37:52

标签: ruby web-scraping html-table nokogiri mechanize

我想从这张桌子上获取物品:

<table style="margin: auto;width: 800px" id="myTable" class="tablesorter">
    <thead>
        <tr class="TableHeader">
            <th >Game</th><th>Icon</th><th>Achievement</th>
                                <th>Achievers</th>
                    <th>Value</th>
                        </tr>
    </thead>
    <tbody>
            <tr>
                        <td><a href="Steam_Game_Info.php?AppID=440"><img alt="Logo" src="http://cdn.akamai.steamstatic.com/steamcommunity/public/images/apps/440/07385eb55b5ba974aebbe74d3c99626bda7920b8.jpg" width=133 height=50 ></a></td>
                        <td>    <table>
        <tr>
            <td class="AchievementBox" style="background-color: #347C17">
                <a href="Steam_Achievement_Info.php?AchievementID=169&amp;AppID=440">                <img  alt="Icon" src="http://cdn.akamai.steamstatic.com/steamcommunity/public/images/apps/440/924764eea604817d3c14de9640ae6422c7cdfb7a.jpg" height='50' width='50'>
                </a>            </td>
        </tr>
    </table>
</td>
            <td style="text-align: left" ><a href="Steam_Achievement_Info.php?AchievementID=169&amp;AppID=440">Race for the Pennant</a><br>Run 25 kilometers.</td>
            <td style="text-align: right">35505</td><td style="text-align: right">1.3</td>

该表的ID为myTable,所以我想做的是:

go inside <tbody>
for each <tr> in table:
    do something; maybe go inside <td> or get a link from <href>

我有:

require 'mechanize'

agent = Mechanize.new
page = agent.get("http://astats.astats.nl/astats/TopListAchievements.php?DisplayType=2")

puts page.body

这会打印页面,但我如何实际遍历表格行?

1 个答案:

答案 0 :(得分:2)

使用css选择器打印文本和href属性值:

require 'nokogiri'
doc = Nokogiri::HTML(page.body)
doc.css('table#myTable tbody td[3] a').each {|a|
  puts a.text, a[:href]
}