我正在尝试为nutch创建一个插件。我正在使用nutch 1.7和solr。我使用了很多不同的教程。我想实现一个返回原始html数据的插件。我使用了nutch的标准wiki和以下教程:http://sujitpal.blogspot.nl/2009/07/nutch-custom-plugin-to-parse-and-add.html
我创建了两个文件getDivinfohtml.java和getDivinfo.java。
getDivinfohtml.java需要读取内容然后返回完整的源代码。或至少是源代码的一部分
package org.apache.nutch.indexer;
public class getDivInfohtml implements HtmlParseFilter
{
private static final Log LOG = LogFactory.getLog(getDivInfohtml.class);
private Configuration conf;
public static final String TAG_KEY = "source";
// Logger logger = Logger.getLogger("mylog");
// FileHandler fh;
//FileSystem fs = FileSystem.get(conf);
//Path file = new Path(segment, Content.DIR_NAME + "/part-00000/data");
//SequenceFile.Reader reader = new SequenceFile.Reader(fs, file, conf);
//Text key = new Text();
// Content content = new Content();
// fh = new FileHandler("/root/JulienKulkerNutch/mylogfile.log");
// logger.addHandler(fh);
// SimpleFormatter formatter = new SimpleFormatter();
//fh.setFormatter(formatter);
public ParseResult filter(Content content, ParseResult parseResult, HTMLMetaTags metaTags, DocumentFragment doc)
{
try
{
LOG.info("Parsing Url:" + content.getUrl());
LOG.info("Julien: "+content.toString().substring(content.toString().indexOf("<!DOCTYPE html")));
Parse parse = parseResult.get(content.getUrl());
Metadata metadata = parse.getData().getParseMeta();
String fullContent = metadata.get("fullcontent");
Document document = Jsoup.parse(fullContent);
Element contentwrapper = document.select("div#jobBodyContent").first();
String source = contentwrapper.text();
metadata.add("SOURCE", source);
return parseResult;
}
catch(Exception e)
{
LOG.info(e);
}
return parseResult;
}
public Configuration getConf()
{
return conf;
}
public void setConf(Configuration conf)
{
this.conf = conf;
}
}
它立即读取complete内容,然后在jobBodyContent中提取文本。
然后我们有解析器需要将数据放入字段
getDivinfo(解析器)
public NutchDocument filter(NutchDocument doc, Parse parse, Text url, CrawlDatum datum, Inlinks inlinks)
{
// LOG.info("Julien is sukkel");
try
{
fh = new FileHandler("/root/JulienKulkerNutch/mylogfile2.log");
SimpleFormatter formatter = new SimpleFormatter();
fh.setFormatter(formatter);
logger.info("Julien is sukkel");
Metadata metadata = parse.getData().getParseMeta();
logger.info("julien is gek:");
String fullContent = metadata.get("SOURCE");
logger.info("Output:" + metadata);
logger.info(fullContent);
String fullSource = parse.getData().getParseMeta().getValues(getDivInfohtml.TAG_KEY);
logger.info(fullSource);
doc.add("divcontent", fullContent);
}
catch(Exception e)
{
//LOG.info(e);
}
return doc;
}
erros在getDivinfo中:String fullSource = parse.getData()。getParseMeta()。getValues(getDivInfohtml.TAG_KEY);
[javac] /root/JulienKulkerNutch/apache-nutch-1.8/src/plugin/myDivSelector/src/java/org/apache/nutch/indexer/getDivInfo.java:58:错误:找不到符号 [javac] String fullSource = parse.getData()。getParseMeta()。getValues(getDivInfohtml.TAG_KEY);
答案 0 :(得分:0)
您可能需要实现HTMLParser。在你的getFields实现中,
private static final Collection<WebPage.Field> FIELDS = new HashSet<WebPage.Field>();
static {
FIELDS.add(WebPage.Field.CONTENT);
FIELDS.add(WebPage.Field.OUTLINKS);
}
public Collection<Field> getFields() {
return FIELDS;
}