Question

我有一个包含以下内容的html文件：

<div class="title"><a href="../dorothy_perkins_true_blue_suedette/thing?id=130434603" title="Dorothy Perkins True blue suedette clutch bag">Dorothy Perkins True blue suedette clutch bag</a></div>

我想在href中提取网址。我有以下代码：

            Document doc = Jsoup.connect(url).get();
            Elements products = doc.select("div.title a[href]");
            System.out.println("size: "+products.size());

然而，打印显示大小为0.它找不到任何匹配项。我使用的网址是http://www.polyvore.com/bags/shop?category_id=35。您可以查看来源，我非常确定上面的代码是正确的。如果有人能提出一些想法，这将是很好的。非常感谢。

Answer 1

我相信您使用以下代码进行连接。

doc = Jsoup.connect("http://www.polyvore.com/bags/shop?category_id=35").get();

如果您执行System.out.println(doc.html());，则会返回整个HTML源代码块，这与我们通过Mozilla和Chrome等浏览器看到的完全不同。

要解决此问题，您需要在Jsoup连接中指定userAgent参数，如下所示。

    Document doc = null;
    Elements aEles = null;

    try {
        // doc = Jsoup.connect("http://www.polyvore.com/bags/shop?category_id=35").get();

        doc = Jsoup.connect("http://www.polyvore.com/bags/shop?category_id=35")
                .userAgent("Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36")
                .referrer("http://www.google.com").get();

        if (doc != null) {
            aEles = doc.select("div.title > a");

            if (aEles != null)
                System.out.println("size: " + aEles.size());
        }
    } catch (Exception e) {
        e.printStackTrace();
    }

使用jsoup选择div中的标签

1 个答案: