我记得,有一个选项可以将xml
解析器配置为使用tagsoup
,但是我也不能回忆起语法。如果可能,我正在使用JAXB
来clean up一些令人讨厌的html
。
试图编组:
package my.books;
import java.io.File;
import java.io.FileInputStream;
import java.net.URI;
import java.util.Properties;
import java.util.logging.Logger;
import javax.xml.bind.JAXB;
import javax.xml.transform.stream.StreamSource;
import org.xml.sax.XMLReader;
public class App {
private static final Logger LOG = Logger.getLogger(App.class.getName());
private Properties properties = new Properties();
public static void main(String[] args) throws Exception {
new App().htmlToXhtml();
}
private void htmlToXhtml() throws Exception {
properties.loadFromXML(App.class.getResourceAsStream("/properties.xml"));
LOG.info(properties.toString());
URI inputURI = new URI(properties.getProperty("html_input"));
File htmlInputFile = new File(inputURI);
FileInputStream fileInputStream = new FileInputStream(htmlInputFile);
StreamSource streamSource = new StreamSource();
streamSource.setInputStream(fileInputStream);
XMLReader xmlReader = new org.ccil.cowan.tagsoup.Parser(); //but it's html, not xml...
Foo foo = JAXB.unmarshal(streamSource, Foo.class); //foo is ...?
}
}
org.xml.sax.SAXParseException
及其相关内容:
thufir@dur:~/NetBeansProjects/books$
thufir@dur:~/NetBeansProjects/books$ gradle clean run
> Task :run FAILED
Jan 17, 2019 9:15:47 PM my.books.App htmlToXhtml
INFO: {output=file:/home/thufir/xml/output.xml, basex_path=file:/home/thufir/.basex/, html_input=file:/home/thufir/xml/wget/index.html}
Exception in thread "main" javax.xml.bind.DataBindingException: javax.xml.bind.UnmarshalException
- with linked exception:
[org.xml.sax.SAXParseException; lineNumber: 665; columnNumber: 191; The element type "img" must be terminated by the matching end-tag "</img>".]
at javax.xml.bind.JAXB.unmarshal(JAXB.java:262)
at my.books.App.htmlToXhtml(App.java:33)
at my.books.App.main(App.java:18)
Caused by: javax.xml.bind.UnmarshalException
JAXB
不能做到这一点,因为它是任意的html
而不是预期的xml
?
我忘记了s9api
的身份:
https://stackoverflow.com/a/6787473/262852
实际上,它looks like it's possible带有反射。坦率地说,我很惊讶这不是图书馆。或者,如果它不是一个库,那么也许我正在走自己的路,但步伐不好。 (很明显,至少有另一个人对我有同样的问题。)
我要提供的主要有用的代码片段是取消编组 XML数据通过JAXB进行反射。我想这样做的原因是 我可能并不总是知道我将要使用的特定XML对象 反序列化。另外,因为我很懒,所以我不在乎或不想 知道XML文档的内部细节是什么:)。