Question

我的 HTML 代码为1000行，我想提取在HTML＆lt;＆gt;之外写的数据。标签

例如..

<>Java Programm<>

它应该只读“Java Programm”并转义“＆lt;＆gt;”内写的任何内容标签

我尝试了以下代码，但它正在读取整个数据，包括＆lt;＆gt;但我不需要“＆lt;＆gt;”在我的输出中。

public static void main(String[] args) throws Exception {

    try {
        FileInputStream fin = new FileInputStream("C:\\Users\\File.txt");
        int i;
        while ((i=fin.read())!=-1) {
            System.out.print((char)i);

        }
    } catch (Exception e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
}

Answer 1

您需要一个HTML解析器。对于JSoup它的

File input = new File("C:\\Users\\File.txt");
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");   
Element body = doc.body(); //Get the body of the html
System.out.println(body.text()) ; //Get the all the text inside the body tag

这是一种方法。很简单:)，当然还有其他方法可以做到这一点。这个文本会将文本留在body标签之外。您可以浏览JSoup a here并找到解决方案。

如何读取在html＆lt;＆gt;之外写的字符串标签？

1 个答案: