将JAXB与HTML一起使用时,SAXParseException

时间:2019-01-18 05:21:27

标签: html jaxb xhtml xjc tag-soup

我记得,有一个选项可以将xml解析器配置为使用tagsoup,但是我也不能回忆起语法。如果可能,我正在使用JAXBclean up一些令人讨厌的html

试图编组:

package my.books;

import java.io.File;
import java.io.FileInputStream;
import java.net.URI;
import java.util.Properties;
import java.util.logging.Logger;
import javax.xml.bind.JAXB;
import javax.xml.transform.stream.StreamSource;
import org.xml.sax.XMLReader;

public class App {

    private static final Logger LOG = Logger.getLogger(App.class.getName());
    private Properties properties = new Properties();

    public static void main(String[] args) throws Exception {
        new App().htmlToXhtml();
    }

    private void htmlToXhtml() throws Exception {
        properties.loadFromXML(App.class.getResourceAsStream("/properties.xml"));
        LOG.info(properties.toString());
        URI inputURI = new URI(properties.getProperty("html_input"));
        File htmlInputFile = new File(inputURI);

        FileInputStream fileInputStream = new FileInputStream(htmlInputFile);
        StreamSource streamSource = new StreamSource();
        streamSource.setInputStream(fileInputStream);

        XMLReader xmlReader = new org.ccil.cowan.tagsoup.Parser();  //but it's html, not xml...

        Foo foo = JAXB.unmarshal(streamSource, Foo.class);  //foo is ...?
    }

}

org.xml.sax.SAXParseException及其相关内容:

thufir@dur:~/NetBeansProjects/books$ 
thufir@dur:~/NetBeansProjects/books$ gradle clean run

> Task :run FAILED
Jan 17, 2019 9:15:47 PM my.books.App htmlToXhtml
INFO: {output=file:/home/thufir/xml/output.xml, basex_path=file:/home/thufir/.basex/, html_input=file:/home/thufir/xml/wget/index.html}
Exception in thread "main" javax.xml.bind.DataBindingException: javax.xml.bind.UnmarshalException
 - with linked exception:
[org.xml.sax.SAXParseException; lineNumber: 665; columnNumber: 191; The element type "img" must be terminated by the matching end-tag "</img>".]
        at javax.xml.bind.JAXB.unmarshal(JAXB.java:262)
        at my.books.App.htmlToXhtml(App.java:33)
        at my.books.App.main(App.java:18)
Caused by: javax.xml.bind.UnmarshalException

JAXB不能做到这一点,因为它是任意的html而不是预期的xml

我忘记了s9api的身份:

https://stackoverflow.com/a/6787473/262852


实际上,它looks like it's possible带有反射。坦率地说,我很惊讶这不是图书馆。或者,如果它不是一个库,那么也许我正在走自己的路,但步伐不好。 (很明显,至少有另一个人对我有同样的问题。)

  

我要提供的主要有用的代码片段是取消编组   XML数据通过JAXB进行反射。我想这样做的原因是   我可能并不总是知道我将要使用的特定XML对象   反序列化。另外,因为我很懒,所以我不在乎或不想   知道XML文档的内部细节是什么:)。

0 个答案:

没有答案