我需要使用Java从HTML文件中删除未关闭的标签。有快速的方法吗?某些API在解析文件时会自动删除未关闭的标记?或者怎么做呢?
答案 0 :(得分:0)
这个想法是处理整个文件并找到每个开始标记的结束标记。如果我们找不到结束标记,我们会保存开始标记的行号,以便稍后从文件中删除该行。
/*
* Returns a stack with the line numbers of tags that don't have a closing tag.
*/
public static Stack<Integer> removeUnclosedTags(String filePath) {
//Stores all HTML tags
Stack<String> tags = new Stack<>();
//Stores the line numbers for the tags
Stack<Integer> lineNumbers = new Stack<>();
//Stores the line numbers for tags without a closing one
Stack<Integer> linesToRemove = new Stack<>();
try (BufferedReader br = new BufferedReader(new FileReader(filePath))) {
int lineNumber = 0;
String line = br.readLine();
while (line != null) {
lineNumber++;
line = line.trim();
//No tag on this line or a tag that gets closed right away (e.g. <br />) - just continue
if(!line.contains("<") || line.contains("/>")) {
line = br.readLine();
continue;
}
//Check if line starts with a closing tag
if(line.trim().startsWith("</")) {
//If HTML tag matches the one on the top of the stack, remove it and continue
if(line.split("</")[1].split(">")[0].split(" ")[0].equals(tags.peek())) {
tags.pop();
lineNumbers.pop();
line = br.readLine();
//If it does not match, we have an unclosed tag and store the line number
} else {
System.out.println("unclosed tag at line number " + lineNumbers.peek() + ": " + tags.pop());
linesToRemove.push(lineNumbers.pop());
}
//If it is a starting tag
} else if(line.startsWith("<")) {
//Push it to the stack so we can compare it later
tags.push(line.split("<")[1].split(">")[0].split(" ")[0]);
lineNumbers.push(lineNumber);
line = br.readLine();
}
}
} catch (Exception e) {
e.printStackTrace();
}
return linesToRemove;
}
此方法返回一个堆栈,其中包含不具有结束标记的行号。然后我们可以删除它们:
public static void main(String[] args) {
String filePath = "/some/path/test.html";
Stack<Integer> lines = removeUnclosedTags(filePath);
File inputFile = new File(filePath);
File tempFile = new File(filePath.replace(".html", "_cleaned.html"));
BufferedReader reader;
BufferedWriter writer;
try {
reader = new BufferedReader(new FileReader(inputFile));
writer = new BufferedWriter(new FileWriter(tempFile));
String lineToRemove = "bbb";
String currentLine;
int lineNumber = 0;
while((currentLine = reader.readLine()) != null) {
lineNumber++;
if(lines.empty() || lineNumber != lines.peek()) {
writer.write(currentLine + System.getProperty("line.separator"));
} else {
lines.pop();
}
}
writer.close();
reader.close();
//Comment this line if you want a separate file
tempFile.renameTo(inputFile);
} catch (Exception e) {
e.printStackTrace();
}
}