Question

对于每个超链接，我必须为每个URL（不包括所有标签）提取实际超链接，锚文本和周围段落。

我能够轻松地使用jsoup提取链接数据，但无法提取包含超链接的段落。我尝试了以下事项：

Elements links = doc.select("a[href]");

for (Element link : links) {
        // get the value from href attribute
        System.out.println("\nlink : " + link.attr("abs:href"));
        System.out.println("text : " + link.text()); 
        System.out.println("Surr : " + link.select("p").text()); -- not work
       }

有谁知道如何实现这个目标？

Answer 1

如果您对嵌套在段落中的链接感兴趣，可以使用此选择器：

Elements paragraphs = document.select("p:has(a[href])")

然后，当您迭代这些段落元素时，您可以通过以下方式提取嵌套的a元素：

for (Element paragraph : paragraphs) {
    System.out.println(paragraph.select("a[href]"));
}

在这种情况下，您可以访问嵌套的a元素和它们所包含的段落。

我创建了一个简单的要点，您可以轻松下载并轻松运行 - https://gist.github.com/wololock/ffd9ef32f7abe3f325b0

Jsoup - 为提取的URL提取环绕段落

1 个答案: