Question

我需要使用Java从HTML文件中删除未关闭的标签。有快速的方法吗？某些API在解析文件时会自动删除未关闭的标记？或者怎么做呢？

Answer 1

这个想法是处理整个文件并找到每个开始标记的结束标记。如果我们找不到结束标记，我们会保存开始标记的行号，以便稍后从文件中删除该行。

/*
 *  Returns a stack with the line numbers of tags that don't have a closing tag. 
 */
public static Stack<Integer> removeUnclosedTags(String filePath) {
    //Stores all HTML tags
    Stack<String> tags = new Stack<>();
    //Stores the line numbers for the tags
    Stack<Integer> lineNumbers = new Stack<>();
    //Stores the line numbers for tags without a closing one
    Stack<Integer> linesToRemove = new Stack<>();

    try (BufferedReader br = new BufferedReader(new FileReader(filePath))) {
        int lineNumber = 0;
        String line = br.readLine();
        while (line != null) {
            lineNumber++;
            line = line.trim();

            //No tag on this line or a tag that gets closed right away (e.g. <br />) - just continue
            if(!line.contains("<") || line.contains("/>")) {
                line = br.readLine();
                continue;
            }
            //Check if line starts with a closing tag
            if(line.trim().startsWith("</")) {

                //If HTML tag matches the one on the top of the stack, remove it and continue
                if(line.split("</")[1].split(">")[0].split(" ")[0].equals(tags.peek())) {
                    tags.pop();
                    lineNumbers.pop();
                    line = br.readLine();

                    //If it does not match, we have an unclosed tag and store the line number
                } else {
                    System.out.println("unclosed tag at line number " + lineNumbers.peek() + ": " + tags.pop());
                    linesToRemove.push(lineNumbers.pop());
                }

                //If it is a starting tag
            } else if(line.startsWith("<")) {
                //Push it to the stack so we can compare it later
                tags.push(line.split("<")[1].split(">")[0].split(" ")[0]);
                lineNumbers.push(lineNumber);
                line = br.readLine();
            }
        }
    } catch (Exception e) {
        e.printStackTrace();
    } 

    return linesToRemove;
}

此方法返回一个堆栈，其中包含不具有结束标记的行号。然后我们可以删除它们：

public static void main(String[] args) {

    String filePath = "/some/path/test.html";

    Stack<Integer> lines = removeUnclosedTags(filePath);

    File inputFile = new File(filePath);
    File tempFile = new File(filePath.replace(".html", "_cleaned.html"));

    BufferedReader reader;
    BufferedWriter writer;
    try {
        reader = new BufferedReader(new FileReader(inputFile));
        writer = new BufferedWriter(new FileWriter(tempFile));

        String lineToRemove = "bbb";
        String currentLine;
        int lineNumber = 0;

        while((currentLine = reader.readLine()) != null) {
            lineNumber++;

            if(lines.empty() || lineNumber != lines.peek()) {
                writer.write(currentLine + System.getProperty("line.separator"));
            } else {
                lines.pop();
            }
        }
        writer.close(); 
        reader.close(); 
        //Comment this line if you want a separate file
        tempFile.renameTo(inputFile);
    } catch (Exception e) {
        e.printStackTrace();
    } 


}

使用Java删除HTML文件中的未关闭标记

1 个答案: