如何使用Jaunt库从网站上抓取数据?

时间:2017-05-31 05:52:47

标签: java jaunt-api

我想从这个网站获得标题:http://feeds.foxnews.com/foxnews/latest

就像这个例子:

import com.jaunt.JauntException;
import com.jaunt.UserAgent;

public class p8_1
{

    public static void main(String[] args)
    {
        try
        {
            UserAgent userAgent = new UserAgent();
            userAgent.visit("http://feeds.foxnews.com/foxnews/latest"); 
            String title = userAgent.doc.findFirst
("<title><![CDATA[SUCCESSFUL INTERCEPT Pentagon confirms it shot down ICBM-type target]]></title>").getText();
              System.out.println("\n " + title); 


        } catch (JauntException e)
        {
            System.err.println(e);
        }

    }

}

它将显示如下文字:

“成功的截获”五角大楼证实它击落了ICBM型目标 美国五角大楼称,“美国进行了成功的导弹拦截试验”

这是我的代码。我使用了jaunt库。

我不知道为什么它只显示文字“foxnew.com”

system.src.js:1056 GET 
http://localhost:9000/jspm_packages/github/heruan/aurelia-
breadcrumbs@0.2.5/breadcrumbs.js 404 (Not Found)G @ 
system.src.js:1056(anonymous function) @ system.src.js:1782e._execute @ 
bluebird.min.js:31i._resolveFromExecutor @ bluebird.min.js:32i @ 
bluebird.min.js:32(anonymous function) @ system.src.js:1781(anonymous 
function) @ system.src.js:2810(anonymous function) @ 
system.src.js:3388(anonymous function) @ system.src.js:3702(anonymous 
function) @ system.src.js:4094(anonymous function) @ 
system.src.js:4557(anonymous function) @ system.src.js:4826(anonymous 
function) @ system.src.js:412r @ 
bluebird.min.js:33i._settlePromiseFromHandler @     
bluebird.min.js:32i._settlePromise @ 
bluebird.min.js:32i._settlePromise0 @ 
bluebird.min.js:32i._settlePromises @ bluebird.min.js:32r._drainQueue 
@ bluebird.min.js:31r._drainQueues @ bluebird.min.js:31drainQueues @ 
bluebird.min.js:31
bluebird.min.js:31 
Unhandled rejection Error: (SystemJS) XHR error (404 Not Found)    
loading http://localhost:9000/jspm_packages/github/heruan/aurelia-
breadcrumbs@0.2.5/breadcrumbs.js
Error: XHR error (404 Not Found) loading 
http://localhost:9000/jspm_packages/github/heruan/aurelia-
breadcrumbs@0.2.5/breadcrumbs.js
Error loading 
http://localhost:9000/jspm_packages/github/heruan/aurelia-
breadcrumbs@0.2.5/breadcrumbs.js

1 个答案:

答案 0 :(得分:0)

搜索元素类型,而不是值。

请尝试以下操作以获取Feed中每个项目的标题文字:

public static void main(String[] args) {
    try {
        UserAgent userAgent = new UserAgent();
        userAgent.visit("http://feeds.foxnews.com/foxnews/latest");

        Elements items = userAgent.doc.findEach("<item>");
        Elements titles = items.findEach("<title>");

        for (Element title : titles) {
            String titleText = title.getComment(0).getText();
            System.out.println(titleText);
        }
    } catch (JauntException e) {
        System.err.println(e);
    }
}