我试图使用JTidy来转换html文件的intp xml文件。 这是其他人给出的在线示例代码:
public class Html2Xml {
private String outFileName;
private String errOutFileName;
public Html2Xml(String outFileName, String errOutFileName) {
this.outFileName = outFileName;
this.errOutFileName = errOutFileName;
}
public void convert() {
BufferedInputStream in;
FileOutputStream out;
Tidy tidy = new Tidy();
// Tell Tidy to convert HTML to XML
tidy.setXmlOut(true);
try {
// Set file for error messages
tidy.setErrout(new PrintWriter(new FileWriter(errOutFileName), true));
tidy.setForceOutput(true);
tidy.setInputEncoding("utf-8");
tidy.setEncloseText(false);
tidy.setXmlOut(true);
// u = new URL(url);
File f = new File("e:/.../gd.htm");
// input and output
in = new BufferedInputStream(new FileInputStream(f));
out = new FileOutputStream(outFileName);
// Convert files
tidy.parseDOM(in, out);
// Clean up
in.close();
out.close();
} catch (IOException e) {
System.out.println(this.toString() + e.toString());
}
}
public static void main(String[] args) {
Html2Xml t = new Html2Xml("e:/...../right1.xml", "e:/...../error1.xml");
t.convert();
}
}
它可以工作,并可以以xml文件格式获取输出。但是当我打开它时,它仍然以
<!DOCTYPE HTML> <!DOCTYPE html PUBLIC "" ""> <html>
因为XML文档只允许单个DOCTYPE节点,或者根本不允许。
因此当我尝试使用eclipse xpath插件获取文件中的锚元素的xpath时,它总是显示org.xml.sax.saxparseexception:已经看过doctype(我认为它是因为多个相同的元素在xml文件中)。
提前感谢您的帮助!