我有一个html如下:
<div class="info">
<h5>
<a href="/aaa/">aaa </a>
</h5>
<span class="date">
8:27AM, Sep 30</span>
</div>
我正在使用Ruby,我希望得到"8:27AM, Sep 30"
内的文字<span class="date">
。我无法通过以下命令找到它。
find('div.info span.date').text
你能告诉我为什么它不起作用吗?如果我使用以下命令在h5
内找到文本,我可以正确地获得"aaa"
。
find('div.info h5').text
完整红宝石代码
Then(/^you should see (\d+) latest items$/) do |arg1|
within("div.top-feature-list") do
# Validate images of those items exist, print report
expect(all("img").size.to_s).to eq(arg1)
puts "The number of items on the current site is " + (all("img").size.to_s)
# List of all items' details (Image, Headline, Introduction, Identifier, Url)
$i = 1
while $i <= arg1.to_i do
puts "Item no." + $i.to_s
puts " - Image: " + find('ul.category-index li.item-' + $i.to_s + ' img')[:src].to_s
puts " - Headline: " + find('ul.category-index li.item-' + $i.to_s + ' div.info h5').text
puts " - Introduction: " + find('ul.category-index li.item-' + $i.to_s + ' div.summary').text
puts " - Url: " + find('ul.category-index li.item-' + $i.to_s + ' div.info h5 a')[:href].to_s
puts " - Created Date " + find('ul.category-index li.item-' + $i.to_s + ' div.info span.date').text
puts " - Identifier: " + find('ul.category-index li.item-' + $i.to_s + ' div.img a.section-name').text
puts " - Subsection: " + find('ul.category-index li.item-' + $i.to_s + ' div.img a.section-name')[:href].to_s
$i +=1
end
end
end
更多HTML
<div class="top-feature-list">
<ul class="category-index">
<li class="group">
<ul>
<li class="item-1 left ">
<a name="item-1"></a>
<div class="img">
<a href="/health-lifestyle/item1.html">
<img alt="How to" src="//image_url">
</a>
<a class="section-name test" href="/health-lifestyle/">
LIFESTYLE </a>
</div>
<div class="info">
<h5>
<a href="/health-lifestyle/item1.html">
How to </a>
</h5>
<span class="date">
10:20AM, Sep 30</span>
</div>
<div class="summary">
<p>
Summary text</p>
</div>
</li>
....
env.rb
require 'parallel_tests'
require 'capybara/cucumber'
require 'capybara/poltergeist'
require 'rspec'
答案 0 :(得分:0)
在Ruby中解析HTML非常容易。您所需要的只是在程序中需要两个宝石:
require 'open-uri'
require 'nokogiri'
# Set the page you are going to scan.
page = Nokogiri::HTML(open("http://google.com/"))
# (Updated to reflect the date class provided in question)
# Extract specific elements via CSS selector.
# This first selects all everything that has span tag,
# then narrows down to anything with class of ".date"
# use .strip to remove any whitespace from HTML
page.css('span').css('.date').text.strip!
# => outputs "8:27AM, Sep 30"
如果您想了解有关使用Ruby解析HTML的更多信息,您需要使用Google搜索并阅读它。帮助您入门的一个重要资源是here。
答案 1 :(得分:0)
使用find('.info > .date').text
获取网页上的内容。
irb(main):035:0> find('.info').text
=> "aaa 8:27AM, Sep 30"
irb(main):036:0> find('.info > .date').text
=> "8:27AM, Sep 30"