Question

我正在尝试使用Jsoup从名为morningstar的网站获取股票数据。我查看了其他论坛，但未能找出问题所在。

我正在尝试更高级的数据报废，但我似乎无法获得价格。我要么返回null，要么一无所获。

我知道其他语言和API，但我想使用Jsoup因为它看起来非常强大。

这是我到目前为止所拥有的：

public class Scrape {
    public static void main(String[] args){
        String URL = "http://www.morningstar.com/stocks/xnas/aapl/quote.html";
        Document d = new Document(URL);
        try{
            d = Jsoup.connect(URL).get();
        }catch(IOException e){
            e.printStackTrace();
        }
        Element stuff = d.select("#idPrice gr_text_bigprice").first();
        System.out.println("Price of AAPL: " + stuff);
        }
}

任何帮助都将不胜感激。

Answer 1

由于内容是使用javascript动态创建的，因此您可以使用无头浏览器，例如HtmlUnit https://sourceforge.net/projects/htmlunit/

有关价格等的信息嵌入在iFrame中，因此我们首先抓住（也动态构建）iFrame链接并随后解析iFrame。

java.util.logging.Logger.getLogger("com.gargoylesoftware").setLevel(java.util.logging.Level.OFF);

final WebClient webClient = new WebClient(BrowserVersion.CHROME);
webClient.getOptions().setCssEnabled(false);
webClient.getOptions().setJavaScriptEnabled(true);
webClient.getOptions().setThrowExceptionOnScriptError(false);
webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
webClient.getOptions().setTimeout(1000);

HtmlPage page = webClient.getPage("http://www.morningstar.com/stocks/xnas/aapl/quote.html");

Document doc = Jsoup.parse(page.asXml());

String title = doc.select(".r_title").select("h1").text();

String iFramePath = "http:" + doc.select("#quote_quicktake").select("iframe").attr("src");

page = webClient.getPage(iFramePath);

doc = Jsoup.parse(page.asXml());

System.out.println(title + " | Last Price [$]: " + doc.select("#last-price-value").text());

打印：

Apple Inc | Last Price [$]: 98.63

HtmlUnit中的javascript引擎相当慢（上面的代码在我的机器上大约需要18秒），因此查看其他javascript引擎/无头浏览器（phantomJs等）可能会有用;查看此列表选项：https://github.com/dhamaniasad/HeadlessBrowsers）以提高性能，但HtmlUnit可以完成工作。您还可以尝试使用自定义WebConnectionWrapper：

过滤不相关的脚本，图片等

http://htmlunit.10904.n7.nabble.com/load-parse-speedup-tp22735p22738.html

使用Jsoup获取Web元素

1 个答案: