Question

我希望通过epinions.com页面解析一些有关少数公司的统计数据。 Epinions几乎没有id或类，因此解析网站非常困难。

我需要遍历所有<tr bgcolor="white">个对象。我已经提出了2个样本。

从样本1中，我需要提取：

此行alt：

<img src="http://img.epinions.com/images/epi_images/ratings/checks_sm_5.0.gif" alt="Store Rating: 5.0" width="79" height="13" border="0">

此行href：

<a href="/content_218093751940" style="text-decoration:none;">CHUMBO ROCKS!</a>

此行author：

<span class="rgr">by <a  href="/user-whitey436" itemprop="author">whitey436</a>,&nbsp;Jan 18, 2006

以下是示例1：

<tr bgcolor="white">
  <td style="padding:10px 5px" align="right" valign="top" height="100%">
    <table cellspacing="4" cellpadding="0" border="0" width=100% height="100%">
      <tr valign="top">
        <td class="rkr" nowrap>Overall Rating:</td>
        <td width=80>
          <img src="http://img.epinions.com/images/epi_images/ratings/checks_sm_5.0.gif" alt="Store Rating: 5.0" width="79" height="13" border="0">
        </td>
      </tr>
      <span class="rgr">
        <tr>
          <td class="rgr" nowrap>Ease of Ordering:</td>
          <td>
            <img src="http://img.epinions.com/images/epi_images/e3/quant_5.gif" width=80 height=11>
          </td>
        </tr>
        <tr>
          <td class="rgr" nowrap>Customer Service:</td>
          <td>
            <img src="http://img.epinions.com/images/epi_images/e3/quant_5.gif" width=80 height=11>
          </td>
        </tr>
        <tr>
          <td class="rgr" nowrap>Selection:</td>
          <td>
            <img src="http://img.epinions.com/images/epi_images/e3/quant_5.gif" width=80 height=11>
          </td>
        </tr>
        <tr>
          <td class="rgr" nowrap>On-Time Delivery:</td>
          <td>
            <img src="http://img.epinions.com/images/epi_images/e3/quant_5.gif" width=80 height=11>
          </td>
        </tr>
      </span>
      <tr valign="bottom" height="100%">
        <td class="rkb" colspan="2">
          <div align="center"> </div>
          <div align="center"> </div>
        </td>
      </tr>
    </table>
  </td>
  <td style="padding:10px;" colspan=2 width="100%" align="left" valign="top">
    <h2 style="font-family:arial,helvetica,sans-serif; font-size:87%; color:#000000; font-weight:bold; margin-bottom:0px;">
      <a href="/content_218093751940" style="text-decoration:none;">CHUMBO ROCKS!</a>
    </h2>
    <span style="line-height:110%">
      <span class="rgr">by <a  href="/user-whitey436" itemprop="author">whitey436</a>,&nbsp;Jan 18, 2006
      Rated a <span style="color:#000;">Very Helpful Review</span> by the Epinions community</span>
    </span>
    <span class="rkr">
      <div style="padding:5px 0px"> Its just this simple, I tried buying this receiver from another online supplier who had the lowest price only to find they didnt have any of these units and they wanted to sell me extra warranty then tried to sell a different model in stock from Yamaha  ...</div>
      <b>
        <a  href="/content_218093751940">Read the full review</a>
      </b>
    </span>
  </td>
</tr>

从样本2中，我需要提取：

此行alt：

<img src="http://img.epinions.com/images/epi_images/ratings/checks_sm_5.0.gif" alt="Store Rating: 5.0" width="79" height="13" border="0">

此行href：

<a  href="/content_224519491204">Read more</a>

此行author：

<span class="rgr">by <a  href="/user-whitey436" itemprop="author">whitey436</a>,&nbsp;Jan 18, 2006
Rated a <span style="color:#000;">Very Helpful Review</span> by the Epinions community</span>

以下是示例2：

<tr bgcolor="white">
  <td style="padding:10px 5px" align="right" valign="top">
    <table cellspacing="4" cellpadding="0" border="0" width=100%>
      <tr>
        <td class="rkr" nowrap>Overall Rating:</td>
        <td width=80>
          <img src="http://img.epinions.com/images/epi_images/ratings/checks_sm_5.0.gif" alt="Store Rating: 5.0" width="79" height="13" border="0">
        </td>
      </tr>
      <tr>
        <td class='rgr' >&nbsp;</td>
        <td>
          <img src='http://img.epinions.com/images/epi_images/spacer.gif' width=80 height=11>
        </td>
      </tr>
    </table>
  </td>
  <td style="padding:10px;" colspan=2 width="100%" align="left" valign="top">
    <span class="rgr">Mar 27, 2006 <br>(Not Yet Rated)</span><br>
    <span class="rkr"> Very helpful in giving me the information I needed to make a purchase.<br><b>
      <a  href="/content_224519491204">Read more</a>
    </b></span>
  </td>
</tr>

Answer 1

以下是一些使用XPath打印出您想要的信息的Nokogiri代码：

xml.xpath("//tr[@bgcolor='white']").each do |el|
  # Get the "Overall rating" tr block from the first td and get (first) img alt
  puts el.at_xpath("td[1]//tr[td/text()='Overall Rating:']//img/@alt")
  # Get the first link from the second td that contains "content" and get href
  puts el.at_xpath("td[2]//a[contains(@href, '/content')][1]/@href")
  # Get the (first) link that has an itemprop author value and get the href
  puts el.at_xpath("td[2]//a[@itemprop='author']/@href")
end

Answer 2

使用Nokogiri会好的。

获取alt，获取所有图像标记并使用指定的src保留img标记

imgs = doc.css('img[src="http://img.epinions.com/images/epi_images/ratings/checks_sm_5.0.gif"]')

取回href

links = doc.css('a[href*="/content"]')

找回作者

links = doc.css('a[href*="/user"]')

使用Nokogiri或xpath在没有id或类的网页上解析表格

2 个答案: