所以我一直在努力从页面中抓取数据并显示它(与源的格式大致相同)。我找到了YQL,我发现它很棒,除了我无法弄清楚如何只显示整个输出没有什么特别的(基本格式除外)
YQL输入代码是:
select * from html where url="http://directory.vancouver.wsu.edu/anthropology" and xpath="//div[@id='facdir']"
使用它返回JSON:
http://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20html%20where%20url%3D%22http%3A%2F%2Fdirectory.vancouver.wsu.edu%2Fanthropology%22%20and%20xpath%3D%22%2F%2Fdiv%5B%40id%3D'facdir'%5D%22&format=json&callback=anthropology
我已经按照雅虎的教程,创建了新闻小部件,但没有一个教程涵盖了基本视图(也不需要链接,只是段落设置)。
喜欢这样:
Name
Title
Phone:(###)###-####
Location: Building and Room #
email@vancouver.wsu.edu
以下是我从http://christianheilmann.com输出的内容,但它没有做任何事情(显然她的教程都没有工作,每一个都尝试过):
<html>
<head>
<script src="http://code.jquery.com/jquery-latest.js"></script>
</head>
<body>
<p>
<b>Copied:</b>
</p>
<div>
<script>
function anthropology (0) {
// get the DIV with the ID $
var info = document.getElementById('facdir');
// add a class for styling
info.className = 'js';
// if it exists
if(info){
// get the info data returned from YQL
var data = o.query.results.span;
var link = info.getElementsByTagName('a')[0];
link.innerHTML = '(see all info)';
// to the main container DIV
var out = document.createElement('span');
out.className = 'info';
info.insertBefore(out,link.parentNode);
}
}
</script>
<script src='http://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20html%20where%20url%3D%22http%3A%2F%2Fdirectory.vancouver.wsu.edu%2Fanthropology%22%20and%20xpath%3D%22%2F%2Fdiv%2Fdiv%2Fdiv%2Fdiv%2Fdiv%2Fdiv%2Fdiv%2Fdiv%2Fdiv%2Fdiv%2Fdiv%22&format=json&callback=anthropology'></script>
</div>
答案 0 :(得分:4)
我最近完成了一个包含几个jsFiddles的教程,并解释了如何使用YQL
,XPATH
和jQuery .ajax()
来解决不同的SO问题,这将有所启发在你的方向。您可以看到SO Answer here。
为了符合您的问题的可接受答案,我汇总了一个工作演示,向您展示数据从您请求的网页中抓取数据是多么容易。
jsFiddle演示包含大量评论和console.log()
消息,以便了解工作流程。确保您激活浏览器控制台并使用Firebug作为示例。用于构建 Faculty Member Boxes 的HTML
和CSS
模仿来自原始网站的内容,包括图片,姓名,电子邮件和网页主题中的链接。
<强>样本:强>
jsFiddle Data Scraping XML: Dynamic Webpage Building
修改!!! 除了上面修改过的jsFiddle,请参阅相关的
jsFiddle Tutorial: Creating Dynamic Div's (Now Improved!)
HTML:
<div id="results"></div>
jQuery:
var directoryName = 'child-development-program';
$.ajax({
type: 'GET',
url: "http://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20html%20where%20url%3D%22http%3A%2F%2Fdirectory.vancouver.wsu.edu%2F" + directoryName + "%22%20and%20xpath%3D%22%2F%2Fdiv%5B%40id%3D'content-inner'%5D%2Fdiv%2Fdiv%2Fdiv%2Fdiv%2Fdiv%5B2%5D%22",
dataType: 'xml',
success: function(data) {
if (data) {
// Show in console the jQuery Object.
console.info('Here is the returned query');
console.log( $(data).find('query') );
// Show in console the results in inner-html text.
var textResults = $(data).find('results').text();
console.log( textResults );
// Parse the list of faculty members. Variable indexFM is not used for indexed faculty member.
$(data).find('results').find('.views-row').each(function(indexFM){
// This variable will store the current faculty member.
var facultyMember = this;
console.info('Faculty jQuery DIV Object shown on next lines.');
console.log( facultyMember );
// Parse the contents of each faculty member. Variable indexFC is not used for indexed faculty content.
$(facultyMember).each(function(indexFC){
// Get Thumbnail Image of Faculty Member
var facultyMemberImage = $(this).find('.views-field-field-profile-image-fid #directoryimage a img').attr('src');
console.log( facultyMemberImage );
// Get Title (Name) of Faculty Member
var facultyMemberTitle = $(this).find('.views-field-field-professional-title-value #largetitle').text();
console.log( facultyMemberTitle );
// Get relative URL fragment.
//
// Stackoverflow Edit: Much more extraction in this section, see jsFiddle link above.
//
// Get Email of Faculty Member
var facultyMemberEmail = $(this).find('.views-field-field-email-value span').text();
// Simple dashed line to separate faculty members as seen in browser console.
console.log('--------');
var divObject = '<div class="dynamicResults"><div class="dynamicThumb"><a href="' + facultyMemberUrl + '"><img src="' + facultyMemberImage + '" alt=""></a></div><div class="dynamicInfo"><div class="dynamicText"><a href="' + facultyMemberUrl + '" class="dynamicName">' + facultyMemberTitle + '</a></div><div class="dynamicText">' + facultyMemberPosition + '</div><div class="dynamicText">Phone: ' + facultyMemberPhone + '</div><div class="dynamicText">Location: ' + facultyMemberBuilding + ' <span>' + facultyMemberRoom + '</span></div><div class="dynamicText"><a href="' + facultyMemberEmailUrl + '" class="dynamicEmail">' + facultyMemberEmail + '</a><span class="dynamicEmailpic"></span></div></div></div><div class="clear"></div>';
// Build webpage with dynamic data.
$('#results').append( divObject );
});
});
}
}
});
屏幕截图: 照片中的缩略图为100px x 100px 修订后的照片,用于修订后的jsFiddle !!
但是在真正关注你的问题时,我想尝试一些新的和简单的...但结果是非常可接受的。这一次,数据抓取技术使用网页本地CSS
文件作为jsFiddle中的资产,同时还将返回的数据直接用于DOM
。
此方法使用与上述相同的原则,除了使用html
作为.ajax()
dataType
来提供原始网页的近乎克隆。唯一的缺点是需要整个CSS文件,但是您可以解析原始文件以删除不需要的多余样式和选择器(重要的是不要破坏IE中的4096 CSS选择器障碍)。 / p>
<强>样本:强>
jsFiddle Data Scraping HTML: Clone That Webpage
HTML
<link type="text/css" rel="stylesheet" media="all" href="http://directory.vancouver.wsu.edu/sites/directory.vancouver.wsu.edu/files/css/css_f9f00e4e3fa0bf34a1cb2b226a5d8344.css" />
<div id="facultyAnthropology"></div>
<强> jQuery的:强>
var directoryName = 'anthropology';
$.ajax({
type: 'GET',
url: "http://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20html%20where%20url%3D%22http%3A%2F%2Fdirectory.vancouver.wsu.edu%2F"+directoryName+"%22%20and%20xpath%3D%22%2F%2Fdiv%5B%40id%3D'content-area'%5D%22",
dataType: 'html',
success: function(data) {
$('#facultyAnthropology').append($(data).find('results'));
}
});
屏幕截图: 如上图所示,照片中的缩略图为100px x 100px