Question

我使用jsoup解析来自内容字符串的所有网址链接，效果很好。

带有网址的内容字符串的一部分，如您所见，链接显示在文字＆＃34;下载说明之后：＆＃34;，＆＃34;镜像：＆＃34;和＆＃34;附加：＆＃34;：

<u>Download Instructions:</u><br/>
<a class="postlink" href="https://test.com/info">https://test.com/info</a>
<br/>Mirror:<br/>
<a class="postlink" href="http://global.eu/navi.html">http://global.eu/navi.html</a>
<br/>Additional:<br/>
<a class="postlink" href="http://main.org/navi.html">http://main.org/navi.html</a>

现在我的目标是在文字＆＃34;下载说明：＆＃34; 和文字＆＃34;镜像：＆后解析所有网址（可以是多个）＃34; 单独，＆＃34;其他＆＃34;应该被忽略。

下面的代码片段只解析它们并将它们添加到（url）arraylist。

int j = 0;
Document doc = Jsoup.parse(content);
Elements links = doc.select("a.postlink");
for (Element el : links) {
    String urlman = el.attr("abs:href");
    if (urlman != null) {
        url.add(j, urlman);
        j++;
    }
}

如果有人可以提供协助会很棒。

提前谢谢。

Answer 1

根据您发布的结构，您可以检查以前的兄弟节点，以查找描述锚点的节点（此处为#text或<u>标记）。然后简单地做一些形式的String比较。

示例代码

String source = "<u>Download Instructions:</u><br/><a class=\"postlink\" href=\"https://1test.com/info\">https://test.com/info</a><br/><a class=\"postlink\" href=\"https://2test.com/info\">https://test.com/info</a><br/><a class=\"postlink\" href=\"https://3test.com/info\">https://test.com/info</a><br/>Mirror:<br/><a class=\"postlink\" href=\"http://global.eu/navi1.html\">http://global.eu/navi.html</a><br/><a class=\"postlink\" href=\"http://global.eu/navi2.html\">http://global.eu/navi.html</a><br/>Additional:<br/><a class=\"postlink\" href=\"http://main.org/navi.html\">http://main.org/navi.html</a>";

Document doc = Jsoup.parse(source, "UTF-8");

List<String> downloadInstructionsUrls = new ArrayList<>();
List<String> mirrorUrls = new ArrayList<>();

for (Element el : doc.select("a.postlink")) {
    Node previousSibling = el.previousSibling();

    while( !(previousSibling.nodeName().equals("u") || previousSibling.nodeName().equals("#text")) ){
        previousSibling = previousSibling.previousSibling();
    }

    String identifier = previousSibling.toString();

    if(identifier.contains("Download Instructions")){
        downloadInstructionsUrls.add(el.attr("abs:href"));
    }else if(identifier.toString().contains("Mirror")){
        mirrorUrls.add(el.attr("abs:href"));
    }
}

System.out.println("Url for download instructions:");
downloadInstructionsUrls.forEach(url -> {System.out.println("\t"+url);});
System.out.println("Url for mirror:");
mirrorUrls.forEach(url -> {System.out.println("\t"+url);});

<强>输出

Url for download instructions:
    https://1test.com/info
    https://2test.com/info
    https://3test.com/info
Url for mirror:
    http://global.eu/navi1.html
    http://global.eu/navi2.html

Jsoup：分别解析url链接

1 个答案: