Question

我正在使用抓取工具从网站捕获数据。现在，我正在尝试选择所有<h1>元素，并打印它（暂时）。我注意到有一些标题只包含 ，这使得数据看起来是空的。

我想要使用值<h1>排除 。

以下是我的尝试：

`private static void getAllH1(String url, Element tCon) {
//      System.out.println("Url: " + url);
        Elements headers1 = tCon.getElementsByTag("h1");
        System.out.println("Url\t\tHeader");
        for(Element h1: headers1) {
            if(h1.text().length()!=0 && h1.text()!="\u00a0") {
                System.out.println(url + "\t\t" + h1.text());
            }
        }
    }`

编辑：我从这里的一个主题中看到jsoup阅读 为\u00a0，但它仍无效。

以下是输出示例：

`
Url     Header
http://www.url.com/index.asp        Quick Links
http://www.url.com/index.asp        What's New
http://www.url.com/index.asp         
http://www.url.com/index.asp        What's Next
http://www.url.com/index.asp        What's On
http://www.url.com/index.asp        Key Rates
http://www.url.com/index.asp        Public Advisories

` 提前谢谢！

Answer 1

我从这个链接找到了答案：

Element.text() doesn't normalize ' ' whitespace #529

所以我做了，从jsoup-1.9.2，我将我的jsoup更新为jsoup-1.11.2。然后，当我运行代码（相同的代码;没有改动）时，它最终识别出 。

jsoup - 捕获<h1>元素，不包括具有该值的元素

1 个答案: