我正在使用Nutch 1.9。
当大多数网页使用javascript生成时,Nutch忽略了Javascripte生成的内容。是否有可能获取它?
我发现Selenium可能是一种方法,但它似乎只有Nutch 2.x支持。是否可以与Nutch 1.9集成(以及如何)?
我已经按照nutch-selenium上的安装说明进行了操作,但是当我运行ant时,很明显发生了很多错误。
compile:
[echo] Compiling plugin: protocol-selenium
[javac] Compiling 2 source files to $NUTCH_HOME/build/protocol-selenium/classes
[javac] warning: [options] bootstrap class path not set in conjunction with -source 1.6
[javac] $NUTCH_HOME/src/plugin/protocol-selenium/src/java/org/apache/nutch/protocol/selenium/Http.java:14: error: package org.apache.nutch.storage does not exist
[javac] import org.apache.nutch.storage.WebPage;
[javac] ^
[javac] $NUTCH_HOME/src/plugin/protocol-selenium/src/java/org/apache/nutch/protocol/selenium/Http.java:15: error: package org.apache.nutch.storage.WebPage does not exist
[javac] import org.apache.nutch.storage.WebPage.Field;
[javac] ^
[javac] $NUTCH_HOME/src/plugin/protocol-selenium/src/java/org/apache/nutch/protocol/selenium/Http.java:26: error: package WebPage does not exist
[javac] private static final Collection<WebPage.Field> FIELDS = new HashSet<WebPage.Field>();
[javac] ^
[javac] $NUTCH_HOME/src/plugin/protocol-selenium/src/java/org/apache/nutch/protocol/selenium/Http.java:49: error: cannot find symbol
[javac] protected Response getResponse(URL url, WebPage page, boolean redirect)
[javac] ^
[javac] symbol: class WebPage
[javac] location: class Http
[javac] $NUTCH_HOME/src/plugin/protocol-selenium/src/java/org/apache/nutch/protocol/selenium/Http.java:55: error: package WebPage does not exist
[javac] public Collection<WebPage.Field> getFields() {
[javac] ^
[javac] $NUTCH_HOME/src/plugin/protocol-selenium/src/java/org/apache/nutch/protocol/selenium/HttpResponse.java:16: error: package org.apache.nutch.storage does not exist
[javac] import org.apache.nutch.storage.WebPage;
[javac] ^
[javac] $NUTCH_HOME/src/plugin/protocol-selenium/src/java/org/apache/nutch/protocol/selenium/HttpResponse.java:47: error: cannot find symbol
[javac] public HttpResponse(Http http, URL url, WebPage page, Configuration conf) throws ProtocolException, IOException {
[javac] ^
[javac] symbol: class WebPage
[javac] location: class HttpResponse
[javac] $NUTCH_HOME/src/plugin/protocol-selenium/src/java/org/apache/nutch/protocol/selenium/Http.java:26: error: package WebPage does not exist
[javac] private static final Collection<WebPage.Field> FIELDS = new HashSet<WebPage.Field>();
[javac] ^
[javac] $NUTCH_HOME/src/plugin/protocol-selenium/src/java/org/apache/nutch/protocol/selenium/Http.java:29: error: package WebPage does not exist
[javac] FIELDS.add(WebPage.Field.MODIFIED_TIME);
[javac] ^
[javac] $NUTCH_HOME/src/plugin/protocol-selenium/src/java/org/apache/nutch/protocol/selenium/Http.java:30: error: package WebPage does not exist
[javac] FIELDS.add(WebPage.Field.HEADERS);
[javac] ^
[javac] $NUTCH_HOME/src/plugin/protocol-selenium/src/java/org/apache/nutch/protocol/selenium/Http.java:54: error: method does not override or implement a method from a supertype
[javac] @Override
[javac] ^
[javac] 11 errors
[javac] 1 warning
BUILD FAILED
$NUTCH_HOME/build.xml:112: The following error occurred while executing this line:
$NUTCH_HOME/src/plugin/build.xml:77: The following error occurred while executing this line:
$NUTCH_HOME/src/plugin/build-plugin.xml:133: Compile failed; see the compiler error output for details.
或者还有其他选择吗?