Question

我正在尝试使用jsoup读取robot.txt文件。我想逐行阅读这个文件，并确定是否禁止/允许/使用/使用/ sitemap。

使用Jsoup我执行以下操作：

robotfile = Jsoup.connect（u）.get（）;

robotfile.text（）;

然而，后者给了我：

80legs User-agent：008 Disallow：/ User-Agent：bender Disallow：/ my_sh .. etc

即使我使用.html（），我看不到任何换行符（例如标签），因此我无法用简单的换行符替换所有这些值。

有没有办法逐行读取这个文件？

谢谢！

Answer 1

JSoup实际上是为阅读和解析HTML文件而构建的。 robots.txt文件不是HTML文件，最好由简单的输入流读取。这是一个简单的连接，它读取Googles robots.txt文件。

public static void main(String[] args) {
    try(BufferedReader in = new BufferedReader(
            new InputStreamReader(new URL("http://google.com/robots.txt").openStream()))) {
        String line = null;
        while((line = in.readLine()) != null) {
            System.out.println(line);
        }
    } catch (IOException e) {
        e.printStackTrace();
    }
}

输出（由于篇幅而截断）：

User-agent: *
Disallow: /search
Disallow: /sdch
Disallow: /groups
Disallow: /images
Disallow: /catalogs
...

用jsoup逐行阅读robot.txt

1 个答案: