Question

我需要解析一个家庭作业项目的HTML文件，因此我不能使用Jsoup。

我曾尝试对文件进行爬网，但我不知道如何保存要查找的内容。

这就是我所拥有的：

    FileInputStream fis = new FileInputStream(filename);
    InputStreamReader inStream = new InputStreamReader(fis);
    BufferedReader reader = new BufferedReader(inStream);

    String fileLine;
    while((fileLine = reader.readLine()) != null){

        String tag = fileLine.substring(fileLine.indexOf("<") + 1,fileLine.indexOf(">"))
    }

我需要在title>标记内找到信息，但是我不知道如何在不获取不需要的标记的情况下获取该信息，或者如何处理没有标记的情况。

我想在标题标签中获取信息，并将其转换为我可以使用的字符串。

Answer 1

String fileDataString = Files.readAllLines(Paths.get(fileName), Charset.forName("UTF-8")).stream().collect(Collectors.joining("\n"));

String title = StringUtils.substringBetween(fileDataString, "<title>", "</title>"));

这应该可以使和

之间的文本

编辑：谢谢BlackPearl提出的Stream<String>.collect(Collectors.joining("\n"));建议

如何不使用Jsoup解析html文件？

1 个答案: