在Tika中为Nutch实现HTMLMapper时出现NoClassDefFoundError异常

时间:2020-07-01 03:55:41

标签: java nutch apache-tika

我正在Nutch中使用Tika解析爬网的HTML页面,但是我想丢弃一些HTML元素。看来我可以通过实现HtmlMapper接口来覆盖DISCARDABLE_ELEMENTS设置来做到这一点。

这可能是由于我对Java的经验不足,但是当我这样做时,Nutch在调用Tika时出现以下错误:

#Setting variables to pass into args
country=USA
month=10

python A.py     --month="$month"
python B.py     --country="$country" --month="$month"

我实现HtmlMapper的类是经过稍微修改的DefaultHtmlMapper.java,如下所示:

2020-07-01 13:13:27,009 WARN  mapred.LocalJobRunner - job_local1201182862_0001
 java.lang.Exception: java.lang.NoClassDefFoundError: org/apache/tika/parser/html/HtmlMapper
        at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:491)
        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:551)
Caused by: java.lang.NoClassDefFoundError: org/apache/tika/parser/html/HtmlMapper
        at java.base/java.lang.ClassLoader.defineClass1(Native Method)
        at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1016)
        at java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:151)
        at java.base/jdk.internal.loader.BuiltinClassLoader.defineClass(BuiltinClassLoader.java:821)
        at java.base/jdk.internal.loader.BuiltinClassLoader.findClassOnClassPathOrNull(BuiltinClassLoader.java:719)
        at java.base/jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(BuiltinClassLoader.java:642)
        at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:600)
        at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)
        at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:575)
        at org.apache.nutch.plugin.PluginClassLoader.loadClassFromParent(PluginClassLoader.java:90)
        at org.apache.nutch.plugin.PluginClassLoader.loadClass(PluginClassLoader.java:72)
        at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:521)
        at java.base/java.lang.Class.forName0(Native Method)
        at java.base/java.lang.Class.forName(Class.java:333)
        at org.apache.nutch.parse.tika.TikaParser.setConf(TikaParser.java:289)
        at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:169)
        at org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:134)
        at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:75)
        at org.apache.nutch.parse.ParseSegment$ParseSegmentMapper.map(ParseSegment.java:126)
        at org.apache.nutch.parse.ParseSegment$ParseSegmentMapper.map(ParseSegment.java:77)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:793)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
        at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:270)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:830)
Caused by: java.lang.ClassNotFoundException: org.apache.tika.parser.html.HtmlMapper
        at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:602)
        at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)
        at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:521)

然后使用以下nutch-site.xml配置将其激活:

package org.apache.tika.parser.html;

import java.util.HashMap;
import java.util.HashSet;
import java.util.Map;
import java.util.Set;

@SuppressWarnings("serial")
public class CustomHtmlMapper implements HtmlMapper {

    public static final HtmlMapper INSTANCE = new CustomHtmlMapper();

    // Based on http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd
    private static final Map<String, String> SAFE_ELEMENTS = new HashMap<String, String>() {{
        put("H1", "h1");
        put("H2", "h2");
        ...

这让我难过了一段时间,因此,我非常感谢您提供的任何提示!

编辑:因此,我发现我可以通过将CustomHtmlMapper类添加到org.apache.tika.parser.html包中,然后在其中包含我的类的情况下构建tika-parsers jar来使它起作用替换Nutch中的tika-parsers jar,然后将nutch-site.xml的tika.htmlmapper.classname设置更改为org.apache.tika.parser.html.CustomhHtmlMapper。

但是,如果我可以将CustomHtmlMapper放在我自己的程序包中,并让Nutch / Tika引用它,它将更加方便。当JVM在CustomHtmlMapper类中看到对“实现org.apache.tika.parser.html.HtmlMapper”的引用时,JVM似乎会失败(请参见上面的堆栈跟踪)。我希望就如何实现这一目标提供任何建议。

0 个答案:

没有答案