Question

我有一个html如下：

<div class="info">
 <h5>
   <a href="/aaa/">aaa </a>
 </h5>
 <span class="date">
       8:27AM, Sep 30</span>     
</div>

我正在使用Ruby，我希望得到"8:27AM, Sep 30"内的文字<span class="date">。我无法通过以下命令找到它。

find('div.info span.date').text

你能告诉我为什么它不起作用吗？如果我使用以下命令在h5内找到文本，我可以正确地获得"aaa"。

find('div.info h5').text

完整红宝石代码

Then(/^you should see (\d+) latest items$/) do |arg1|
    within("div.top-feature-list") do
       # Validate images of those items exist, print report
       expect(all("img").size.to_s).to eq(arg1)
       puts "The number of items on the current site is " + (all("img").size.to_s)
       # List of all items' details (Image, Headline, Introduction, Identifier, Url)
       $i = 1
       while $i <= arg1.to_i do
          puts "Item no." + $i.to_s
          puts "        - Image:        " + find('ul.category-index li.item-' + $i.to_s + ' img')[:src].to_s
          puts "        - Headline: " + find('ul.category-index li.item-' + $i.to_s + ' div.info h5').text
          puts "        - Introduction: " + find('ul.category-index li.item-' + $i.to_s + ' div.summary').text
          puts "        - Url:      " + find('ul.category-index li.item-' + $i.to_s + ' div.info h5 a')[:href].to_s
          puts "        - Created Date " + find('ul.category-index li.item-' + $i.to_s + ' div.info span.date').text
          puts "        - Identifier:   " + find('ul.category-index li.item-' + $i.to_s + ' div.img a.section-name').text
          puts "        - Subsection:   " + find('ul.category-index li.item-' + $i.to_s + ' div.img a.section-name')[:href].to_s
          $i +=1
      end
    end
  end

更多HTML

<div class="top-feature-list">  
 <ul class="category-index">
    <li class="group">
           <ul>
    <li class="item-1 left ">
        <a name="item-1"></a>
        <div class="img">
            <a href="/health-lifestyle/item1.html">
                <img alt="How to" src="//image_url">     
            </a>

            <a class="section-name test" href="/health-lifestyle/">
                LIFESTYLE </a>
        </div>
        <div class="info">
            <h5>

                <a href="/health-lifestyle/item1.html">
                    How to </a>

            </h5>
            <span class="date">
                10:20AM, Sep 30</span>

        </div>
        <div class="summary">

            <p>
                Summary text</p>

        </div>


    </li>
    ....

env.rb

require 'parallel_tests'
require 'capybara/cucumber'
require 'capybara/poltergeist'
require 'rspec'

Answer 1

在Ruby中解析HTML非常容易。您所需要的只是在程序中需要两个宝石：

require 'open-uri'
require 'nokogiri'

# Set the page you are going to scan.
page = Nokogiri::HTML(open("http://google.com/"))

# (Updated to reflect the date class provided in question)
# Extract specific elements via CSS selector.
# This first selects all everything that has span tag,
# then narrows down to anything with class of ".date"
# use .strip to remove any whitespace from HTML

page.css('span').css('.date').text.strip! 

# => outputs "8:27AM, Sep 30"

如果您想了解有关使用Ruby解析HTML的更多信息，您需要使用Google搜索并阅读它。帮助您入门的一个重要资源是here。

Answer 2

使用find('.info > .date').text获取网页上的内容。

    irb(main):035:0> find('.info').text
    => "aaa 8:27AM, Sep 30"
    irb(main):036:0> find('.info > .date').text
    => "8:27AM, Sep 30"

找不到`<span>`

2 个答案: