Question

考虑一个URl www.example.com它可能有很多链接，一些可能是内部的，另一些可能是外部的。我想获得所有子链接的列表，甚至不是子链接但只有子链接。 E.G如果有以下四个链接

1)www.example.com/images/main
2)www.example.com/data
3)www.example.com/users
4)www.example.com/admin/data

然后四个中只有2个和3个使用，因为它们是子链接而不是子链接等链接。有没有办法通过j-soup实现它。如果这不能通过j-soup然后可以介绍一些其他Java API。另请注意，它应该是最初发送的父Url的链接（即www.example.com）

Answer 1

如果我能理解一个子链接可以包含一个斜杠，你可以尝试用这个来计算斜杠的数量，例如：

List<String> list = new ArrayList<>();
list.add("www.example.com/images/main");
list.add("www.example.com/data");
list.add("www.example.com/users");
list.add("www.example.com/admin/data");

for(String link : list){
    if((link.length() - link.replaceAll("[/]", "").length()) == 1){
        System.out.println(link);
    }
}

link.length()：计算字符数
link.replaceAll("[/]", "").length()：计算斜杠数

如果差异等于1，则右键链接否则

修改

我如何扫描整个网站的子链接？

robots.txt 文件或Robots exclusion standard的答案，因此在此定义了网站的所有子链接，例如https://stackoverflow.com/robots.txt，所以我的想法是，要阅读此文件，您可以从此网站提取子链接，这里有一段代码可以帮助您：

public static void main(String[] args) throws Exception { //Your web site String website = "http://stackoverflow.com"; //We will read the URL https://stackoverflow.com/robots.txt URL url = new URL(website + "/robots.txt"); //List of your sub-links List<String> list; //Read the file with BufferedReader try (BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()))) { String subLink; list = new ArrayList<>(); //Loop throw your file while ((subLink = in.readLine()) != null) { //Check if the sub-link is match with this regex, if yes then add it to your list if (subLink.matches("Disallow: \\/\\w+\\/")) { list.add(website + "/" + subLink.replace("Disallow: /", "")); }else{ System.out.println("not match"); } } } //Print your result System.out.println(list); }

这将告诉你：


[https://stackoverflow.com/posts/，https://stackoverflow.com/posts？   https://stackoverflow.com/search/，https://stackoverflow.com/search？   https://stackoverflow.com/feeds/，https://stackoverflow.com/feeds？   https://stackoverflow.com/unanswered/，   https://stackoverflow.com/unanswered？，https://stackoverflow.com/u/，   https://stackoverflow.com/messages/，https://stackoverflow.com/ajax/，   https://stackoverflow.com/plugins/]

这是Demo about the regex that i use。

希望这可以帮到你。

Answer 2

要扫描网页上的链接，您可以使用JSoup库。

import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

class read_data {

    public static void main(String[] args) {
        try {
            Document doc = Jsoup.connect("**your_url**").get();
            Elements links = doc.select("a");
            List<String> list = new ArrayList<>();
            for (Element link : links) {
                list.add(link.attr("abs:href"));
            }
        } catch (IOException ex) {

        }
    }
}

可以按照上一个答案中的建议使用

列表。

下面给出了阅读网站上所有链接的代码。我使用http://stackoverflow.com/进行说明。我会建议你在浏览公司网站之前先查看公司的terms of use。

import java.io.IOException;
import java.util.HashSet;
import java.util.Set;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;

public class readAllLinks {

    public static Set<String> uniqueURL = new HashSet<String>();
    public static String my_site;

    public static void main(String[] args) {

        readAllLinks obj = new readAllLinks();
        my_site = "stackoverflow.com";
        obj.get_links("http://stackoverflow.com/");
    }

    private void get_links(String url) {
        try {
            Document doc = Jsoup.connect(url).get();
            Elements links = doc.select("a");
            links.stream().map((link) -> link.attr("abs:href")).forEachOrdered((this_url) -> {
                boolean add = uniqueURL.add(this_url);
                if (add && this_url.contains(my_site)) {
                    System.out.println(this_url);
                    get_links(this_url);
                }
            });

        } catch (IOException ex) {

        }

    }
}

您将获得uniqueURL字段中所有链接的列表。

使用jsoup获取URL的子链接

2 个答案: