我需要使用Nokogiri从URL解析HTML表。我的HTML看起来像这样:
<table class="tbl" cellspacing="1" cellpadding="4" id="gvResult" style="width:100%;">
<tbody>
<tr class="trh">
<th scope="col"><a href="javascript:__doPostBack('ctl00$ctl00$b$b$gvResult','Sort$1')">Фирма</a></th>
<th scope="col"><a href="javascript:__doPostBack('ctl00$ctl00$b$b$gvResult','Sort$2')">Артикул</a></th>
<th scope="col">Инф.</th>
<th scope="col"><a href="javascript:__doPostBack('ctl00$ctl00$b$b$gvResult','Sort$3')">Описание</a></th>
<th scope="col"><a href="javascript:__doPostBack('ctl00$ctl00$b$b$gvResult','Sort$6')">Нал.</a></th>
<th scope="col" style="width:55px;"><a href="javascript:__doPostBack('ctl00$ctl00$b$b$gvResult','Sort$8')">Мин. заказ, шт</a></th>
<th scope="col"><a href="javascript:__doPostBack('ctl00$ctl00$b$b$gvResult','Sort$5')">Ожидаемый срок, дн. </a><a href="/help/hint/default.aspx?id=43" onclick="javascript:ShowTipLayer(this, event,this.href,30,20);return false;"><img src="http://s.exist.ru/img/q2.gif" alt="Помощь" /></a></th>
<th scope="col"><a href="javascript:__doPostBack('ctl00$ctl00$b$b$gvResult','Sort$7')">Цена</a></th>
<th scope="col"> </th>
</tr>
<tr>
<td class="tabletitle" colspan="12">Запрошенный артикул</td>
</tr>
<tr onclick="colorize(this);" id="item_0" tcolor="">
<td class="artMerge" id="item_0" rowspan="2"><a onclick="javascript:ShowTipLayer(this,event,this.href, 30,130); return false;" href="hint/?s=47721138-000d-40c7-99f1-02d2f0005c83">Knecht (Mahle Filter)</a></td>
<td class="artMerge" rowspan="2" style="white-space:nowrap;">O * * * D</td>
<td class="artMerge" align="center" rowspan="2" style="white-space:nowrap;"></td>
<td class="artMerge" rowspan="2" style="padding:10px 10px 0 10px;">Фильтр масляный</td>
<td align="center">99</td>
<td align="center">1</td>
<td class="statis"><a class="stat" onclick="javascript:ShowTipLayer(this,event,this.href, 30,130); return false" href="/stat/hint/deliv.aspx?d=0&s=0100f98e-5c00-38be-0297-2f1100002de2" target="_blank">0</a></td>
<td class="price" align="right">56 400 р.</td>
<td class="basket"><a title="Купить" href="/profile/orders/basket.aspx?pid=83A07C7A&in=0100f98e-5c00-38be-0297-2f1100002de2&sr=-4"></a></td>
</tr>
<tr onclick="colorize(this);" id="item_1" tcolor="">
<td align="center">1782</td>
<td align="center">1</td>
<td class="statis"><a class="stat" onclick="javascript:ShowTipLayer(this,event,this.href, 30,130); return false" href="/stat/hint/deliv.aspx?d=1&s=0100f98e-5c00-b5be-0297-0f1200002de2" target="_blank">1</a></td>
<td class="price" align="right">55 000 р.</td>
<td class="basket"><a title="Купить" href="/profile/orders/basket.aspx?pid=83A07C7A&in=0100f98e-5c00-b5be-0297-0f1200002de2&sr=-4"></a></td>
</tr>
<tr>
<td class="tabletitle" colspan="12">Аналоги (заменители) для запрошенного артикула <a href="/news/newstext.aspx?id=1367" target="_blank"><img src="http://s.exist.ru/img/q2.gif" alt="Помощь" /></a></td>
</tr>
<tr onclick="colorize(this);" id="item_2" tcolor="">
<td class="firmname" id="item_2"><a onclick="javascript:ShowTipLayer(this,event,this.href, 30,130); return false;" href="hint/?s=e0c712d8-215d-4000-9a64-f02c7200005c">Alco</a></td>
<td style="white-space:nowrap;">M * * * 5</td>
<td align="center" style="white-space:nowrap;"></td>
<td>Фильтр масляный</td>
<td align="center">1</td>
<td align="center">1</td>
<td class="statis"><a class="stat" onclick="javascript:ShowTipLayer(this,event,this.href, 30,130); return false" href="/stat/hint/deliv.aspx?d=0&s=0100f98e-5c00-71fb-0241-720c00002c7d" target="_blank">0</a></td>
<td class="price" align="right">37 700 р.</td>
<td class="basket"><a title="Купить" href="/profile/orders/basket.aspx?pid=83A07C7A&in=0100f98e-5c00-71fb-0241-720c00002c7d&sr=-4"></a></td>
</tr>
<tr onclick="colorize(this);" id="item_3" tcolor="">
<td class="firmname" id="item_3"><a onclick="javascript:ShowTipLayer(this,event,this.href, 30,130); return false;" href="hint/?s=b113459e-001c-4000-a33e-5021bd20005c">Bosch</a></td>
<td style="white-space:nowrap;">1 * * * 9</td>
<td align="center" style="white-space:nowrap;"></td>
<td>Фильтр масляный</td>
<td align="center">8</td>
<td align="center">1</td>
<td class="statis"><a class="stat" onclick="javascript:ShowTipLayer(this,event,this.href, 30,130); return false" href="/stat/hint/deliv.aspx?d=0&s=0100f98e-5c00-3495-0002-bd11000021c2" target="_blank">0</a></td>
<td class="price" align="right">30 200 р.</td>
<td class="basket"><a title="Купить" href="/profile/orders/basket.aspx?pid=83A07C7A&in=0100f98e-5c00-3495-0002-bd11000021c2&sr=-4"></a></td>
</tr>
<tr onclick="colorize(this);" id="item_4" tcolor="">
<td class="firmname" id="item_4"><a onclick="javascript:ShowTipLayer(this,event,this.href, 30,130); return false;" href="hint/?s=40f20c71-000a-400f-bd7d-02c720005c78">Champion</a></td>
<td style="white-space:nowrap;">X * * * 6</td>
<td align="center" style="white-space:nowrap;"></td>
<td>Фильтр масляный</td>
<td align="center">1</td>
<td align="center">1</td>
<td class="statis"><a class="stat" onclick="javascript:ShowTipLayer(this,event,this.href, 30,130); return false" href="/stat/hint/deliv.aspx?d=0&s=0100f98e-5c00-713d-0142-720c00002cb0" target="_blank">0</a></td>
<td class="price" align="right">59 500 р.</td>
<td class="basket"><a title="Купить" href="/profile/orders/basket.aspx?pid=83A07C7A&in=0100f98e-5c00-713d-0142-720c00002cb0&sr=-4"></a></td>
</tr>
<tr onclick="colorize(this);" id="item_5" tcolor="">
<td class="firmname" id="item_5"><a onclick="javascript:ShowTipLayer(this,event,this.href, 30,130); return false;" href="hint/?s=50d11138-0008-4436-b40f-02d2f0005cc4">Clean filters</a></td>
<td style="white-space:nowrap;">M * * * 0</td>
<td align="center" style="white-space:nowrap;"></td>
<td>Фильтр масляный</td>
<td align="center">100</td>
<td align="center">1</td>
<td class="statis"><a class="stat" onclick="javascript:ShowTipLayer(this,event,this.href, 30,130); return false" href="/stat/hint/deliv.aspx?d=0&s=0100f98e-5c00-385f-012b-2f1100002df5" target="_blank">0</a></td>
<td class="price" align="right">32 500 р.</td>
<td class="basket"><a title="Купить" href="/profile/orders/basket.aspx?pid=83A07C7A&in=0100f98e-5c00-385f-012b-2f1100002df5&sr=-4"></a></td>
</tr>
<tr onclick="colorize(this);" id="item_6" tcolor="">
<td class="firmname" id="item_6"><a onclick="javascript:ShowTipLayer(this,event,this.href, 30,130); return false;" href="hint/?s=e0cb777d-0ecd-4000-8f30-802be890005c">Filtron</a></td>
<td style="white-space:nowrap;">O * * * 1</td>
<td align="center" style="white-space:nowrap;"></td>
<td>Фильтр масляный</td>
<td align="center">10</td>
<td align="center">1</td>
<td class="statis"><a class="stat" onclick="javascript:ShowTipLayer(this,event,this.href, 30,130); return false" href="/stat/hint/deliv.aspx?d=0&s=0100f98e-5c00-b781-0297-e80c00002be2" target="_blank">0</a></td>
<td class="price" align="right">29 000 р.</td>
<td class="basket"><a title="Купить" href="/profile/orders/basket.aspx?pid=83A07C7A&in=0100f98e-5c00-b781-0297-e80c00002be2&sr=-4"></a></td>
</tr>
<tr onclick="colorize(this);" id="item_7" tcolor="">
<td class="firmname" id="item_7"><a onclick="javascript:ShowTipLayer(this,event,this.href, 30,130); return false;" href="hint/?s=64e20c71-000c-40c6-8e51-02c720005c0f">Fram</a></td>
<td style="white-space:nowrap;">C * * * O</td>
<td align="center" style="white-space:nowrap;"></td>
<td>Фильтр масляный</td>
<td align="center">31</td>
<td align="center">1</td>
<td class="statis"><a class="stat" onclick="javascript:ShowTipLayer(this,event,this.href, 30,130); return false" href="/stat/hint/deliv.aspx?d=0&s=0100f98e-5c00-7151-00e8-720c00002c21" target="_blank">0</a></td>
<td class="price" align="right">45 500 р.</td>
<td class="basket"><a title="Купить" href="/profile/orders/basket.aspx?pid=83A07C7A&in=0100f98e-5c00-7151-00e8-720c00002c21&sr=-4"></a></td>
</tr>
<tr onclick="colorize(this);" id="item_8" tcolor="">
<td class="firmname" id="item_8"><a onclick="javascript:ShowTipLayer(this,event,this.href, 30,130); return false;" href="hint/?s=a1135e09-11ba-4000-a216-d02b3670005c">Hengst</a></td>
<td style="white-space:nowrap;">E * * * 8</td>
<td align="center" style="white-space:nowrap;"></td>
<td>Фильтр масляный</td>
<td align="center">10</td>
<td align="center">1</td>
<td class="statis"><a class="stat" onclick="javascript:ShowTipLayer(this,event,this.href, 30,130); return false" href="/stat/hint/deliv.aspx?d=0&s=0100f98e-5c00-35e4-0024-361100002b92" target="_blank">0</a></td>
<td class="price" align="right">48 900 р.</td>
<td class="basket"><a title="Купить" href="/profile/orders/basket.aspx?pid=83A07C7A&in=0100f98e-5c00-35e4-0024-361100002b92&sr=-4"></a></td>
</tr>
<tr onclick="colorize(this);" id="item_9" tcolor="">
<td class="firmname" id="item_9"><a onclick="javascript:ShowTipLayer(this,event,this.href, 30,130); return false;"
...
</tr>
</tbody>
</table>
另请注意,它有俄文符号。
我的Ruby代码如下所示:
html = open('http://exist.by/price.aspx?pcode=ox143d')
require 'nokogiri'
require 'pp'
doc = Nokogiri::HTML(html)
doc.encoding = 'utf-8'
rows = doc.xpath('//table[@id="gvResultTable"]/tbody/tr[@id="item_1"]')
@details = rows.collect do |row|
detail = {}
[
[:firmname, 'td[1]/text()'],
[:price, 'td[8]/text()'],
].each do |name, xpath|
detail[name] = row.at_xpath(xpath).to_s.strip
end
detail
end
pp @details
logger.warn("!!!!!!!!!!")
logger.warn(@details)
我不知道如何使用tr
正确获取itemid
中的数据。
答案 0 :(得分:3)
id
属性:<tr onclick="colorize(this);" id="item_2" tcolor=""> <td class="firmname" id="item_2">
。id
元素的table
为“gvResult
”,而在Ruby代码中,您要求Nokogiri查找“id=gvResultTable
”表格如果HTML可以修复,这可以正常工作:
HTML:
<table id="gvResult">
<tbody>
<tr id="item_1">
<td class="firmname">Example1</td>
<td class="price">42.00</td>
</tr>
<tr id="item_2">
<td class="firmname">Example2</td>
<td class="price">24.00</td>
</tr>
</tbody>
</table>
红宝石:
require 'rubygems'
require 'nokogiri'
require 'pp'
html = open('http://www.example.com/page')
doc = Nokogiri::HTML(html)
doc.encoding = 'utf-8'
rows = doc.search('//tr[starts-with(@id, "item_")]')
@details = rows.collect do |row|
detail = {}
[
[:firmname, 'td[1]/text()'],
[:price, 'td[2]/text()'],
].each do |name, xpath|
detail[name] = row.at_xpath(xpath).to_s.strip
end
detail
end
pp @details
我假设您想要使用tr
之类的所有id
元素获取数据,例如“item_\d+
”,因此我使用了doc.search('//tr[starts-with(@id, "item_")]')
。改变它以满足您的需求。