Question

我正在使用Jsoup下载页面内容然后进行解析。

public static void main(String[] args) throws IOException {
        Document document = Jsoup.connect("http://www.toysrus.ch/product/index.jsp?productId=89689681").get();
        final Elements elements = document.select("dt:contains(" + "EAN/ISBN:" + ")");
        System.out.println(elements.size());
    }

问题：如果您查看网页内容的来源，则存在包含<dt>文字的标记EAN/ISBN:，但如果您运行上面的代码，则会提供你在输出中0，而它应该给我1。我已经使用document.html()检查了html，似乎有html标签，但我想要的标签被<dt>之类的字符取代，而应该<dt>。相同的代码适用于来自同一网站的其他产品网址。

我已经使用过Jsoup并开发了许多解析器，但我不明白为什么上面非常简单的代码无效。真奇怪！这是Jsoup的错误吗？有人能帮助我吗？

Answer 1

当使用connect（）或parse（）时，jsoup默认需要一个有效的html，并在需要时自动格式化输入。您可以尝试使用xml解析器。

    public static void main(String [] args) throws IOException { 
        String url = "http://www.toysrus.ch/product/index.jsp?productId=89689681";
        Document document = Jsoup.parse(new URL(url).openStream(), "UTF-8", "", Parser.xmlParser());
        //final Elements elements = document.select("dt:contains(" + "EAN/ISBN:" + ")");
        // the same as above but more readable:
        final Elements elements = document.getElementsMatchingOwnText("EAN/ISBN");            
        System.out.println(elements.size());
    }

Answer 2

您需要在＆＃39; EAN / ISBN周围加上单引号：＆＃39;值;否则它将被解释为变量。

此外，无需拆分字符串并将各个部分连接在一起。把整个东西放在一个字符串中。

Jsoup解析器仅对特定URL不起作用

2 个答案: