Question

我刚开始探索Jsoup并面临以下问题：当我试图从属于维基百科英文版的https://en.wikipedia.org/wiki/Knowledge中提取链接时，一切正常。

    Document document = Jsoup.connect("https://en.wikipedia.org/wiki/Knowledge").timeout(6000).get();
    Elements linksOnPage = document.select( "a[href^=\"/wiki/\"]");

    for (Element link : linksOnPage) {
        System.out.println("link : " + link.attr("abs:href"));
    }  
}

但是我也得到了不属于当前文章文本的链接，例如：

    link : https://en.wikipedia.org/wiki/Main_Page
    link : https://en.wikipedia.org/wiki/Portal:Contents
    link : https://en.wikipedia.org/wiki/Portal:Featured_content
    link : https://en.wikipedia.org/wiki/Portal:Current_events
    link : https://en.wikipedia.org/wiki/Special:Random
    link : https://en.wikipedia.org/wiki/Help:Contents
    link : https://en.wikipedia.org/wiki/Wikipedia:About
    link : https://en.wikipedia.org/wiki/Wikipedia:Community_portal

通过Jsoup只获取其他维基百科文章中文字的链接的正确方法是什么？

Answer 1

我不需要的链接位于div id =＆＃34; mw-panel＆＃34;

因此正确的选择器将是：

div:not(#mw-panel) a[href^="/wiki/"]

选择<a>元素：

不在<div> ID

mw-panel

及其href属性以"/wiki/"开头。

编辑：

我只需要来自文章的链接，没有来自侧面板的链接，也没有任何链接，例如https://en.wikipedia.org/wiki/Special:BookSources/978-1-4200 -5940-3 https://en.wikipedia.org/wiki/Special:BookSources/1-58450-46 0-9

然后你可以尝试：

#bodyContent a[href^="/wiki/"]

这将解析以下链接：

位于文章（<div>，ID为bodyContent）
他们的href属性以"/wiki/"

div#bodyContent没有"/wiki/...Special:..."个链接。（如果要排除包含其他单词的链接，请将其附加到上面选择器的末尾，不要有任何空格或分隔符：:not([href*="something"])）

您还可以尝试将选择器组合在一起，以根据我的上述尝试和reading about Jsoup selectors获得最佳模式。

示例代码：

String url = "https://en.wikipedia.org/wiki/Knowledge";
Document document = Jsoup.connect(url).timeout(6000).get();
Elements links = document.select("#bodyContent a[href^=\"/wiki/\"]");
for (Element e : links) {
    System.out.println(e.attr("href"));
}
System.out.println("Links found: " + links.size());

这打印出以下内容：

/wiki/Knowledge_(disambiguation)
/wiki/Fact
/wiki/Information
...
/wiki/Category:Articles_with_unsourced_statements_from_September_2007
/wiki/Category:Articles_with_unsourced_statements_from_May_2009
/wiki/Category:Wikipedia_articles_with_GND_identifiers
Links found: 826

jsoup - 如何从维基百科的文章文本中获取链接

1 个答案: