<form method="post" action="/M740/Biography/History/Drama/12+Years+a+Slave">
<input type="image" src="/public_site/webroot/cache/imdb/2024544_100.jpg" width="100" style="float:right;margin-left:2px;">
<strong><span style="color: rgb(255, 69, 0);">12 Years a Slave</span></strong>
<br>
In the antebellum United States, Solomon Northup, a free black man from upstate New York, is abducted and sold into slavery.<br>
<br><strong>Century Cinemax - Junction</strong><br>
<a href="tel:0774136246">0774136246</a>
<a href="tel:0208022073">0208022073</a>
<br>
12:10, 19:10, 21:40<br>
<br><strong>Fox Cineplex Sarit</strong><br>
<a href="tel:0203753025">0203753025</a>
<a href="tel:0720366208">0720366208</a>
<br>
11:00, 14:00, 18:00, 20:40<br>
<br><strong>Planet Media - Kisumu </strong><br>
<a href="tel:0731999100">0731999100</a>
<a href="tel:0724999100 & 0202629388">0724999100 & 0202629388</a>
<br>
12:00, 14:30, 20:30<br>
<br>
<input type="hidden" name="cinema" value="0">
<input type="hidden" name="searchMovie" value="0">
<input type="hidden" name="movie" value="740">
<input type="hidden" name="date" value="0">
<input type="hidden" name="groupId" value="0">
<input type="submit" name="ok" value="Further Details">
</form>
好的,这只是我试图用Nokogiri解析的html部分。 html中的语义不是很到位,而且我很难获得Nokogiri所需的内容。作为参考,这是我要废弃的网站(http://flix.co.ke/Frontpage/Listings)
到目前为止,我能够获得电影的标题,一个电影院和两个电话号码,但凭借我的方法,我无法真正获得所需的所有内容
这是我正在使用的当前脚本
require 'rubygems'
require 'nokogiri'
require 'open-uri'
url = "http://flix.co.ke/Frontpage/Listings"
doc = Nokogiri::HTML(open(url))
doc.css(".min-width div form").each do |entry|
title = entry.at_css("span").text
puts title
cinema = entry.at_css("br+ strong").text
puts cinema
phone = entry.at_css("a").text
puts phone
puts entry.at_css("a").next_element.text
end
有了这个,我只能获得title
电影,one cinema
和two contact numbers
所以我的示例输出看起来像。
12 Years a Slave
Century Cinemax - Junction
0774136246
0208022073
47 Ronin 3D
Century Cinemax - Junction
0774136246
0208022073
Delivery Man
Century Cinemax - Junction
0774136246
0208022073
Frozen
Century Cinemax - Junction
0774136246
0208022073
(continued...)
在break标签之后的标题之后有一个描述,我无法得到它,我如何遍历
标签内的所有电影院?以及逗号分隔的电话号码和个人节目时间。
我只是不知道从哪里开始。我希望在这种情况下实现这样的结果
12年奴隶
在战前的美国,来自纽约州北部的一名自由黑人所罗门·诺苏普被绑架并被卖为奴隶。
等
任何帮助都将受到高度赞赏。提前致谢
答案 0 :(得分:1)
这是可怕的HTML:/它无效,有451个错误和9个警告。没有任何语义,所以你必须依赖结构,这可能会改变,打破你的刮擦。
尽管如此,您可以通过使用兄弟方法获得每个方法:
doc.css('.min-width div form').each do |node|
description = node.at_css('br').next_sibling.text
puts description.strip
puts '-'*10
end
# >> In the antebellum United States, Solomon Northup, a free black man from upstate New York, is abducted and sold into slavery.
# >> ----------
# >> A band of samurai set out to avenge the death and dishonor of their master at the hands of a ruthless shogun.
# >> ----------
# >> An affable underachiever finds out he's fathered 533 children through anonymous donations to a fertility clinic 20 years ago. Now he must decide whether or not to come forward when 142 of them file a lawsuit to reveal his identity.
# >> ----------
# >> Fearless optimist Anna teams up with Kristoff in an epic journey, encountering Everest-like conditions, and a hilarious snowman named Olaf in a race to find Anna's sister Elsa, whose icy powers have trapped the kingdom in eternal winter.
# >> ----------
# >> A medical engineer and an astronaut work together to survive after an accident leaves them adrift in space.
# >> ----------
# >> A pair of aging boxing rivals are coaxed out of retirement to fight one final bout -- 30 years after their last match.
# >> ----------
# >>
# >> ----------
# >> Harrison, overworked and underpaid is looking for money for bride price. A 'business' opportunity presents itself when he gets the keys to the Company house. With the CEO away on holiday, he has access to a vacant fully furnished house. He ...
# >> ----------
# >>
# >> ----------
# >> A chronicle of Nelson Mandela's life journey from his childhood in a rural village through to his inauguration as the first democratically elected president of South Africa.
# >> ----------
# >> Author P. L. Travers reflects on her difficult childhood while meeting with filmmaker Walt Disney during production for the adaptation of her novel, Mary Poppins.
# >> ----------
# >> The Manzoni family, a notorious mafia clan, is relocated to Normandy, France under the witness protection program, where fitting in soon becomes challenging as their old habits die hard.
# >> ----------
# >> The dwarves, along with Bilbo Baggins and Gandalf the Grey, continue their quest to reclaim Erebor, their homeland, from Smaug. Bilbo Baggins is in possession of a mysterious and magical ring.
# >> ----------
# >> The film begins as Katniss Everdeen has returned home safe after winning the 74th Annual Hunger Games along with fellow tribute Peeta Mellark. Winning means that they must turn around and leave their family and close friends, embarking on a ...
# >> ----------
# >> A day-dreamer escapes his anonymous life by disappearing into a world of fantasies filled with heroism, romance and action. When his job along with that of his co-worker are threatened, he takes action in the real world embarking on a global ...
# >> ----------
# >> Faced with an enemy that even Odin and Asgard cannot withstand, Thor must embark on his most perilous and personal journey yet, one that will reunite him with Jane Foster and force him to sacrifice everything to save us all.
# >> ----------
# >> A journey into the lives of a mother polar bear and her two seven-month-old cubs as they navigate the changing Arctic wilderness they call home.
# >> ----------
# >> See and feel what it was like when dinosaurs ruled the Earth, in a story where an underdog dino triumphs to become a hero for the ages.
# >> ----------
使用css
代替at_css
来循环播放电影院(例如,通过表单元素循环播放方式)
答案 1 :(得分:0)
html确实没有那么糟糕,而且您使用br + strong
走在了正确的轨道上,这就是您要迭代的内容:
doc.search('.min-width div form').each do |form|
title = form.at('span').text
description = form.at('br').next.text
form.search('br + strong').each do |el|
cinema = el.text
phones = []
while next_el = el.at('+ a', '+ br + a')
el = next_el
phones << el.text
end
times = el.at('+ br').next.text
end
end