我正在尝试使用带有phantomjs驱动程序的protocol-selenium来使用Nutch抓取基于AJAX的站点。我正在使用从nutch'的github存储库编译的apache-nutch-1.13。这些爬网作为Mesos管理的系统中的任务启动。当我从服务器的终端启动nutch的爬行脚本时,一切都很完美,我按照要求抓取了网站。但是,当我在Mesos任务中执行具有相同参数的相同爬网脚本时,nutch会引发异常:
fetch of http://XXXXX failed with: java.lang.RuntimeException: org.openqa.selenium.NoSuchElementException: {"errorMessage":"Unable to find element with tag name 'body'","request":{"headers":{"Accept-Encoding":"gzip,deflate","Connection":"Keep-Alive","Content-Length":"35","Content-Type":"application/json; charset=utf-8","Host":"localhost:12215","User-Agent":"Apache-HttpClient/4.3.5 (java 1.5)"},"httpVersion":"1.1","method":"POST","post":"{\"using\":\"tag name\",\"value\":\"body\"}","url":"/element","urlParsed":{"anchor":"","query":"","file":"element","directory":"/","path":"/element","relative":"/element","port":"","host":"","password":"","user":"","userInfo":"","authority":"","protocol":"","source":"/element","queryKey":{},"chunks":["element"]},"urlOriginal":"/session/a7f98ec0-b8aa-11e6-8b84-232b0d8e1024/element"}}
我的第一印象是环境变量(HADOOP_HOME,PATH,CLASSPATH ......)有些奇怪,但我在nutch脚本和终端中放置了相同的变量,结果仍然相同。
关于我做错了什么的想法?