需要帮助来找到带有类的元素文本?

时间:2013-07-17 13:57:53

标签: ruby screen-scraping nokogiri

我有一个使用命令page.css(“table.vc_result span a”)获得的文件,我无法获取文件的第二个和第三个Span元素:

文件

<table border="0" bgcolor="#FFFFFF" onmouseout="resDef(this)" onmouseover="resEmp(this)" class="vc_result">
<tbody>
  <tr>
    <td width="260" valign="top">
      <table>
        <tbody>
          <tr>
            <td width="40%" valign="top"><span><a class="cAddName" href="/USA/Illinois/Chicago/Yellow+Page+Advertising+And+Telephone+Directory+Publica/gateway-megatech_13478733">
            Gateway Megatech</a></span><br>
            <span class="cAddText">P.O. BOX 99682, Chicago IL 60696</span></td>
          </tr>

          <tr>
            <td><span class="cAddText">Cook County Illinois</span></td>
          </tr>

          <tr>
            <td><span class="cAddCategory">Yellow Page Advertising And Telephone
            Directory Publica Chicago</span></td>
          </tr>
        </tbody>
      </table>
    </td>

    <td width="260">
      <table align="center">
        <tbody>
          <tr>
            <td>
              <table>
                <tbody>
                  <tr>
                    <td>
                      <div style=
                      "background: url('images/listings.png');background-position: -0px -0px; width: 16px; height: 16px">
                      </div>
                    </td>

                    <td><font style="font-weight:bold">847-506-7800</font></td>
                  </tr>
                </tbody>
              </table>
            </td>
          </tr>

          <tr>
            <td>
              <table>
                <tbody>
                  <tr>
                    <td>
                      <div style=
                      "background: url('images/listings.png');background-position: -0px -78px; width: 16px; height: 16px">
                      </div>
                    </td>

                    <td><a href=
                    "/USA/Illinois/Chicago/Yellow+Page+Advertising+And+Telephone+Directory+Publica/gateway-megatech_13478733"
                    class="cAddNearby">Businesses near 60696</a></td>
                  </tr>
                </tbody>
              </table>
            </td>
          </tr>

          <tr>
            <td>
              <table>
                <tbody>
                  <tr>
                    <td></td>
                  </tr>
                </tbody>
              </table>
            </td>
          </tr>
        </tbody>
      </table>
    </td>
  </tr>
</tbody>
</table>

...这不是完整的文件,该文件中有更多的span条目。

我正在使用的代码能够找到确切的文本,但无法将其与嵌套元素Span A的文本相关联。

require 'rubygems'
require 'nokogiri'
require 'open-uri'
name="yellow"
city="Chicago"
state="IL"

burl="http://www.sitename.com/"
url="#{burl}Business_Listings.php?name=#{name}&city=#{city}&state=#{state}&current=1&Submit=Search"
page = Nokogiri::HTML(open(url)) 

rows = page.css("table.vc_result span a")
rows.each do |arow|

  if arow.text == "Gateway Megatech"
    puts(arow.next_element.text)
    puts("Capturing the next span text")
    found="Got it"
    break       
  else
    puts("Found nothing")
    found="None"
  end
end

2 个答案:

答案 0 :(得分:2)

假设每个商家都是您提供的顶级表格中的新<tr>,以下代码会为您提供一个包含值的哈希数组:

require 'nokogiri'
doc = Nokogiri.HTML(html)

business_rows = doc.css('table.vc_result > tbody > tr')
details = business_rows.map do |tr|
  # Inside the first <td> of the row, find a <td> with a.cAddName in it
  business = tr.at_xpath('td[1]//td[//a[@class="cAddName"]]')
  name     = business.at_css('a.cAddName').text.strip
  address  = business.at_css('.cAddText').text.strip

  # Inside the second <td> of the row, find the first <font> tag
  phone    = tr.at_xpath('td[2]//font').text.strip

  # Return a hash of values for this row, using the capitalization requested
  { Name:name, Address:address, Phone:phone }
end

p details
#=> [
#=>   {
#=>     :Name=>"Gateway Megatech",
#=>     :Address=>"P.O. BOX 99682, Chicago IL 60696",
#=>     :Phone=>"847-506-7800"
#=>   }
#=> ]

这非常脆弱,但适用于你所提供的内容,并且似乎没有太多的语义项可以挂在这种疯狂的恐怖滥用HTML中。

答案 1 :(得分:0)

使用正则表达式解析HTML是一个坏主意,因为HTML不是常规语言。理想情况下,您希望将DOM / XML解析为树结构。

http://nokogiri.org/非常受欢迎。