我正在Nutch中使用Tika解析爬网的HTML页面,但是我想丢弃一些HTML元素。看来我可以通过实现HtmlMapper接口来覆盖DISCARDABLE_ELEMENTS设置来做到这一点。
这可能是由于我对Java的经验不足,但是当我这样做时,Nutch在调用Tika时出现以下错误:
#Setting variables to pass into args
country=USA
month=10
python A.py --month="$month"
python B.py --country="$country" --month="$month"
我实现HtmlMapper的类是经过稍微修改的DefaultHtmlMapper.java,如下所示:
2020-07-01 13:13:27,009 WARN mapred.LocalJobRunner - job_local1201182862_0001
java.lang.Exception: java.lang.NoClassDefFoundError: org/apache/tika/parser/html/HtmlMapper
at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:491)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:551)
Caused by: java.lang.NoClassDefFoundError: org/apache/tika/parser/html/HtmlMapper
at java.base/java.lang.ClassLoader.defineClass1(Native Method)
at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1016)
at java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:151)
at java.base/jdk.internal.loader.BuiltinClassLoader.defineClass(BuiltinClassLoader.java:821)
at java.base/jdk.internal.loader.BuiltinClassLoader.findClassOnClassPathOrNull(BuiltinClassLoader.java:719)
at java.base/jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(BuiltinClassLoader.java:642)
at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:600)
at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:575)
at org.apache.nutch.plugin.PluginClassLoader.loadClassFromParent(PluginClassLoader.java:90)
at org.apache.nutch.plugin.PluginClassLoader.loadClass(PluginClassLoader.java:72)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:521)
at java.base/java.lang.Class.forName0(Native Method)
at java.base/java.lang.Class.forName(Class.java:333)
at org.apache.nutch.parse.tika.TikaParser.setConf(TikaParser.java:289)
at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:169)
at org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:134)
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:75)
at org.apache.nutch.parse.ParseSegment$ParseSegmentMapper.map(ParseSegment.java:126)
at org.apache.nutch.parse.ParseSegment$ParseSegmentMapper.map(ParseSegment.java:77)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:793)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:270)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:830)
Caused by: java.lang.ClassNotFoundException: org.apache.tika.parser.html.HtmlMapper
at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:602)
at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:521)
然后使用以下nutch-site.xml配置将其激活:
package org.apache.tika.parser.html;
import java.util.HashMap;
import java.util.HashSet;
import java.util.Map;
import java.util.Set;
@SuppressWarnings("serial")
public class CustomHtmlMapper implements HtmlMapper {
public static final HtmlMapper INSTANCE = new CustomHtmlMapper();
// Based on http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd
private static final Map<String, String> SAFE_ELEMENTS = new HashMap<String, String>() {{
put("H1", "h1");
put("H2", "h2");
...
这让我难过了一段时间,因此,我非常感谢您提供的任何提示!
编辑:因此,我发现我可以通过将CustomHtmlMapper类添加到org.apache.tika.parser.html包中,然后在其中包含我的类的情况下构建tika-parsers jar来使它起作用替换Nutch中的tika-parsers jar,然后将nutch-site.xml的tika.htmlmapper.classname设置更改为org.apache.tika.parser.html.CustomhHtmlMapper。
但是,如果我可以将CustomHtmlMapper放在我自己的程序包中,并让Nutch / Tika引用它,它将更加方便。当JVM在CustomHtmlMapper类中看到对“实现org.apache.tika.parser.html.HtmlMapper”的引用时,JVM似乎会失败(请参见上面的堆栈跟踪)。我希望就如何实现这一目标提供任何建议。