如何创建一个将原始html返回给解析器的nutch插件

时间:2014-03-25 15:33:30

标签: java solr nutch

我正在尝试为nutch创建一个插件。我正在使用nutch 1.7和solr。我使用了很多不同的教程。我想实现一个返回原始html数据的插件。我使用了nutch的标准wiki和以下教程:http://sujitpal.blogspot.nl/2009/07/nutch-custom-plugin-to-parse-and-add.html

我创建了两个文件getDivinfohtml.java和getDivinfo.java。

getDivinfohtml.java需要读取内容然后返回完整的源代码。或至少是源代码的一部分

 package org.apache.nutch.indexer;
 public class getDivInfohtml implements HtmlParseFilter
 {
private static final Log LOG = LogFactory.getLog(getDivInfohtml.class);
private Configuration conf;
    public static final String TAG_KEY = "source";
    // Logger logger = Logger.getLogger("mylog");
    // FileHandler fh;
    //FileSystem fs = FileSystem.get(conf);
    //Path file = new Path(segment, Content.DIR_NAME + "/part-00000/data");
    //SequenceFile.Reader reader = new SequenceFile.Reader(fs, file, conf);
    //Text key = new Text();
    // Content content = new Content();
    // fh = new FileHandler("/root/JulienKulkerNutch/mylogfile.log");
// logger.addHandler(fh);
// SimpleFormatter formatter = new SimpleFormatter();
//fh.setFormatter(formatter);


public ParseResult filter(Content content, ParseResult parseResult, HTMLMetaTags metaTags, DocumentFragment doc)
{
    try
    {
        LOG.info("Parsing Url:" + content.getUrl());
        LOG.info("Julien: "+content.toString().substring(content.toString().indexOf("<!DOCTYPE html")));

        Parse parse = parseResult.get(content.getUrl());
        Metadata metadata = parse.getData().getParseMeta();
        String fullContent = metadata.get("fullcontent");

        Document document = Jsoup.parse(fullContent);
        Element contentwrapper = document.select("div#jobBodyContent").first();
        String source = contentwrapper.text();
        metadata.add("SOURCE", source);

        return parseResult;

    }
    catch(Exception e)
    {
        LOG.info(e);
    }

    return parseResult;
}


public Configuration getConf()
{
    return conf;
}

public void setConf(Configuration conf)
{
    this.conf = conf;
}

}

它立即读取complete内容,然后在jobBodyContent中提取文本。

然后我们有解析器需要将数据放入字段

getDivinfo(解析器)

public NutchDocument filter(NutchDocument doc, Parse parse, Text url, CrawlDatum datum, Inlinks inlinks)
{
    // LOG.info("Julien is sukkel");
    try
    {
        fh = new FileHandler("/root/JulienKulkerNutch/mylogfile2.log");
        SimpleFormatter formatter = new SimpleFormatter();
        fh.setFormatter(formatter);
        logger.info("Julien is sukkel");
        Metadata metadata = parse.getData().getParseMeta();
        logger.info("julien is gek:");
        String fullContent = metadata.get("SOURCE");
        logger.info("Output:" + metadata);
        logger.info(fullContent);
        String fullSource = parse.getData().getParseMeta().getValues(getDivInfohtml.TAG_KEY);
        logger.info(fullSource);
        doc.add("divcontent", fullContent);

    }
    catch(Exception e)
    {
        //LOG.info(e);
    }

    return doc;
}

erros在getDivinfo中:String fullSource = parse.getData()。getParseMeta()。getValues(getDivInfohtml.TAG_KEY);

[javac] /root/JulienKulkerNutch/apache-nutch-1.8/src/plugin/myDivSelector/src/java/org/apache/nutch/indexer/getDivInfo.java:58:错误:找不到符号     [javac] String fullSource = parse.getData()。getParseMeta()。getValues(getDivInfohtml.TAG_KEY);

1 个答案:

答案 0 :(得分:0)

您可能需要实现HTMLParser。在你的getFields实现中,

 private static final Collection<WebPage.Field> FIELDS = new HashSet<WebPage.Field>();
  static {
    FIELDS.add(WebPage.Field.CONTENT);
    FIELDS.add(WebPage.Field.OUTLINKS);
  }
  public Collection<Field> getFields() {
    return FIELDS;
  }