Question

我有一个Java程序，我正在编写一个接受URL作为参数的方法。有没有办法让方法返回＆＃39; robots.txt＆＃39;的副本。（例如https://www.google.com/robots.txt）与我传递的URL相关联的文件？

提前致谢！

Answer 1

我现在几乎没有关于robot.txt，但我似乎记得，你总是将它存储在root-path中。所以我相信下面示例中的getRobot()方法应该适合您：

import java.io.InputStream;
import java.net.URL;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import org.apache.commons.io.IOUtils;

public class Robots {

    public static void main(String[] args) {
        System.out.println(new Robots().getRobot("http://www.google.de/q?Stack Overflow"));
    }

    public String getRobot(String url) {
        Pattern p = Pattern.compile("^(http(s?)://([^/]+))");
        Matcher m = p.matcher(url);
        if (m.find()) {
            System.out.println(m.group(1));
            try (InputStream in = new URL(m.group(1) + "/robots.txt").openStream()) {
                return IOUtils.toString(in);
            } catch (Exception e) {
                e.printStackTrace();
            }
        }
        return "no robots allowed";
    }
}

查看main()了解工作示例

如何访问网站的robots.txt

1 个答案: