具体错误是Exception in thread "main" java.net.MalformedURLException: no protocol
,因为将html
输出到控制台,所以URL
似乎是完全有效的-因此该错误可能并不有用。
呆在Saxon-HE
和tagsoup
上,我应该首先验证streamResult
吗?
读取控制台输出几乎像将html
包裹在xml
中,然后从Document
中制作一个streamResult
就足够了。 / p>
崩溃:
thufir@dur:~/NetBeansProjects/helloWorldSaxon$ gradle clean run
> Task :run
Exception in thread "main" java.net.MalformedURLException: no protocol: <?xml version="1.0" encoding="UTF-8"?><!--[if lt IE 7]> <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]--><!--[if IE 7]> <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]--><!--[if IE 8]> <html lang="en-us" class="no-js lt-ie9"> <![endif]--><!--[if gt IE 8]><!--><html xmlns:html="http://www.w3.org/1999/xhtml" class="no-js" lang="en-us"><!--<![endif]-->
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<title>
All products | Books to Scrape - Sandbox
</title>
<meta name="created" content="24th Jun 2016 09:29" />
<meta name="description" content="" />
<meta name="viewport" content="width=device-width" />
<meta name="robots" content="NOARCHIVE,NOCACHE" />
<!-- Le HTML5 shim, for IE6-8 support of HTML elements --><!--[if lt IE 9]>
<script src="//html5shim.googlecode.com/svn/trunk/html5.js"></script>
<![endif]-->
<link rel="shortcut icon" href="static/oscar/favicon.ico" />
<link rel="stylesheet" type="text/css" href="static/oscar/css/styles.css" />
<link rel="stylesheet" href="static/oscar/js/bootstrap-datetimepicker/bootstrap-datetimepicker.css" />
<link rel="stylesheet" type="text/css" href="static/oscar/css/datetimepicker.css" />
</head>
..
<!-- Version: N/A -->
</body>
</html>
at java.net.URL.<init>(URL.java:593)
at java.net.URL.<init>(URL.java:490)
at java.net.URL.<init>(URL.java:439)
at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.setupCurrentEntity(XMLEntityManager.java:620)
at com.sun.org.apache.xerces.internal.impl.XMLVersionDetector.determineDocVersion(XMLVersionDetector.java:148)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:806)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:771)
at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:243)
at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:339)
at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:177)
at helloWorldSaxon.HandlerForXML.parseFromURL(HandlerForXML.java:53)
at helloWorldSaxon.App.scrapeHTML(App.java:26)
at helloWorldSaxon.App.main(App.java:19)
> Task :run FAILED
FAILURE: Build failed with an exception.
* What went wrong:
Execution failed for task ':run'.
> Process 'command '/usr/lib/jvm/java-8-openjdk-amd64/bin/java'' finished with non-zero exit value 1
* Try:
Run with --stacktrace option to get the stack trace. Run with --info or --debug option to get more log output. Run with --scan to get full insights.
* Get more help at https://help.gradle.org
BUILD FAILED in 3s
4 actionable tasks: 4 executed
thufir@dur:~/NetBeansProjects/helloWorldSaxon$
值得注意的是,没有结束xml
标签。
代码:
public void parseFromURL() throws SAXException, ParserConfigurationException, IOException, TransformerException {
StringWriter writer = new StringWriter();
StreamResult streamResult = new StreamResult(writer);
TransformerFactory transformerFactory = TransformerFactory.newInstance();
XMLReader xmlReader = XMLReaderFactory.createXMLReader("org.ccil.cowan.tagsoup.Parser");
Source source = new SAXSource(xmlReader, new InputSource(url.toString()));
Transformer transformer = transformerFactory.newTransformer();
transformer.transform(source, streamResult);
String stringResult = writer.toString();
LOG.fine(stringResult);
DocumentBuilderFactory documentBuilderFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = documentBuilderFactory.newDocumentBuilder();
Document document;
document = builder.parse(stringResult);
}
从xml
看build格式良好的stringResult
文档。