我有一个网页列表,我想从这些页面中读取数据。问题是这些页面是用Node.js编写的,并且具有一些JavaScript,并且我被迫将这些页面加载到JavaFX.WebView(因为webview具有JavaScript引擎)并从webview获取这些页面的DOM。我正在使用TransformerFactory来获取DOM。
这段代码运行了一段时间,但是在访问了几页后,我的代码突然停止了。日志是这样的:
#
# A fatal error has been detected by the Java Runtime Environment:
#
# EXCEPTION_ACCESS_VIOLATION (0xc0000005) at pc=0x000000005a1597cb, pid=10044, tid=0x00000000000028b4
#
# JRE version: Java(TM) SE Runtime Environment (8.0_172-b11) (build 1.8.0_172-b11)
完整日志在这里hs_err_pid10044.log
页面加载的代码段也是如此(抱歉,它在kotlin中)
class ChildPageCrawl(url: String, path: String, semaphore: Semaphore) : JFrame() {
init {
title = "Web View"
setSize(800, 600)
val panel = JFXPanel()
this.add(panel)
Platform.runLater({
val browser = WebView()
val webEngine = browser.engine
panel.scene = Scene(browser, 700.0, 500.0)
webEngine.load(url)
webEngine.loadWorker.stateProperty().addListener { _, _, newState ->
if (newState == Worker.State.SUCCEEDED) {
Thread.sleep(10000) //Waiting for page to fully load
semaphore.acquire()
val doc = webEngine.document
val transformer = TransformerFactory.newInstance().newTransformer()
transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "no")
transformer.setOutputProperty(OutputKeys.METHOD, "xml")
transformer.setOutputProperty(OutputKeys.INDENT, "yes")
transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8")
transformer.setOutputProperty("{http://xml.apache.org/xslt}indent-amount", "4")
val source = DOMSource(doc)
val writer = StringWriter()
val result = StreamResult(writer)
transformer.transform(source, result)
val str = writer.toString()
analyzeSML(str)
semaphore.release()
}
}
})
defaultCloseOperation = DISPOSE_ON_CLOSE
}
}
我不知道我要去哪里,WebView和相应的WebEngine似乎也很慢,因为它们呈现了我不需要的网页。