我希望能够从1001tracklists的曲目列表页面中抓取数据。 URL示例为:
http://www.1001tracklists.com/tracklist/25122_lange-intercity-podcast-115-2013-03-06.html
以下是如何在页面上显示数据的示例:
Above & Beyond - Black Room Boy (Above & Beyond Club Mix) [ANJUNABEATS]
我想以下列格式从这个页面中取出所有歌曲:
$byArtist - $name [$publisher]
在查看此页面的HTML后,我看到的内容将以HTML5元微数据格式存储:
<td class="" id="tlptr_433662">
<a name="tlp_433662"></a>
<div itemprop="tracks" itemscope itemtype="http://schema.org/MusicRecording" id="tlp5_content">
<meta itemprop="byArtist" content="Above & Beyond">
<meta itemprop="name" content="Black Room Boy (Above & Beyond Club Mix)">
<meta itemprop="publisher" content="ANJUNABEATS">
<meta itemprop="url" content="/track/103905_above-beyond-black-room-boy-above-beyond-club-mix/index.html">
<span class="tracklistTrack floatL"id="tr_103905" ><a href="/track/103905_above-beyond-black-room-boy-above-beyond-club-mix/index.html" class="">Above & Beyond - Black Room Boy (Above & Beyond Club Mix)</a> </span><span class="floatL">[<a href="/label/1037_anjunabeats/index.html" title="Anjunabeats">ANJUNABEATS</a>]</span>
<div id="tlp5_actions" class="floatL" style="margin-top:1px;">
有一个带有“tlp_433662”值的CSS选择器。页面上的每首歌曲都有自己唯一的ID。一个将有“tlp_433662”,而下一个将有“tlp_433628”或类似的东西。
有没有办法使用Nokogiri和XPath提取轨道列表页面上列出的所有歌曲? 我可能想在下面列出的“数据”上“做”“每个”,以便剪贴板循环提取每组相关数据的数据。这是我的Ruby程序的开始:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
url = "http://www.1001tracklists.com/tracklist/25122_lange-intercity-podcast-115-2013-03-06.html"
data = Nokogiri::HTML(open(url))
# what do do next? print out xpath loop code which extracts my data.
# code block I need help with
data.xpath.........each do |block|
block.xpath("...........").each do |span|
puts stuff printing out what I want.
end
end
我知道如何做的最终目标是将这个Ruby脚本带到Sinatra来“webify”数据并添加一些不错的Twitter bootstrap CSS,如此YouTube视频中所示:http://www.youtube.com/watch?v=PWI1PIvy4A8
你能帮助我使用XPath代码块,以便我可以抓取数据并打印数组吗?
答案 0 :(得分:2)
require 'nokogiri'
require 'rest-client'
url = 'http://www.1001tracklists.com/tracklist/25122_lange-intercity-podcast-115-2013-03-06.html'
page = Nokogiri::HTML(RestClient.get(url,:user_agent=>'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'), nil, 'UTF-8');
page.css('table.detail tr.tlpItem').each do |row|
artist = row.css('meta[@itemprop="byArtist"]').attr('content')
name = row.css('meta[@itemprop="name"]').attr('content')
puts "#{artist} - #{name}"
end
...一个更高级的版本,它抓取行中的所有元信息并打印'艺术家 - 歌曲[发布者]
require 'nokogiri'
require 'rest-client'
url = 'http://www.1001tracklists.com/tracklist/25122_lange-intercity-podcast-115-2013-03-06.html'
page = Nokogiri::HTML(RestClient.get(url,:user_agent=>'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'), nil, 'UTF-8');
page.css('table.detail tr.tlpItem').each do |row|
meta = row.search('meta').each_with_object({}) do |tag, hash|
hash[tag['itemprop']] = tag['content']
end
puts "#{meta['byArtist']} - #{meta['name']} [#{meta['publisher']||'Unknown'}]"
end
您可以获得其余属性的图片。你需要做一些错误/存在吗?检查,因为有些歌曲没有所有属性。但这应该让你走上正轨。我还使用了rest-client
gem,所以随时可以使用你想要检索页面的任何内容。
答案 1 :(得分:2)
这是一些将信息收集到哈希数组中的代码。
我更喜欢在XPath上使用CSS访问器,因为如果你有任何HTML / CSS或jQuery经验,它们会更具可读性。
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open('http://www.1001tracklists.com/tracklist/25122_lange-intercity-podcast-115-2013-03-06.html'))
data = doc.search('tr.tlpItem div[itemtype="http://schema.org/MusicRecording"]').each_with_object([]) do |div, array|
hash = div.search('meta').each_with_object({}) do |m, h|
h[m['itemprop']] = m['content']
end
link = div.at('span a')
hash['tracklistTrack'] = [ link['href'], link.text ]
title = div.at('span.floatL a')
hash['title'] = [title['href'], title.text ]
array << hash
end
pp data[0, 2]
输出页面数据的子集。经过一些按摩后,结构看起来像这样:
[
{
"byArtist"=>"Markus Schulz",
"name"=>"The Spiritual Gateway (Transmission 2013 Theme)",
"publisher"=>"COLDHARBOUR RECORDINGS",
"url"=>"/track/108928_markus-schulz-the-spiritual-gateway-transmission-2013-theme/index.html",
"tracklistTrack"=>[
"/track/108928_markus-schulz-the-spiritual-gateway-transmission-2013-theme/index.html",
"Markus Schulz - The Spiritual Gateway (Transmission 2013 Theme)"
],
"title"=>[
"/track/108928_markus-schulz-the-spiritual-gateway-transmission-2013-theme/index.html",
"Markus Schulz - The Spiritual Gateway (Transmission 2013 Theme)"
]
},
{
"byArtist"=>"Lange & Audrey Gallagher",
"name"=>"Our Way Home (Noah Neiman Remix)",
"publisher"=>"LANGE RECORDINGS",
"url"=>"/track/119667_lange-audrey-gallagher-our-way-home-noah-neiman-remix/index.html",
"tracklistTrack"=>[
"/track/119667_lange-audrey-gallagher-our-way-home-noah-neiman-remix/index.html",
"Lange & Audrey Gallagher - Our Way Home (Noah Neiman Remix)"
],
"title"=>[
"/track/119667_lange-audrey-gallagher-our-way-home-noah-neiman-remix/index.html",
"Lange & Audrey Gallagher - Our Way Home (Noah Neiman Remix)"
]
}
]
答案 2 :(得分:0)
这个免费的网络服务会从给定的网址中删除所有400多个schema.org类,并将它们作为JSON返回