如何使用Apache Nutch抓取.pdf链接

时间:2013-07-03 07:25:41

标签: apache hadoop nutch

我有一个要抓取的网站,其中包含一些指向pdf文件的链接。 我希望nutch抓取该链接并将其转储为.pdf文件。 我正在使用Apache Nutch1.6,我在java中将其作为

ToolRunner.run(NutchConfiguration.create(), new Crawl(),
                                 tokenize(crawlArg));
 SegmentReader.main(tokenize(dumpArg));

有人可以帮助我吗

2 个答案:

答案 0 :(得分:3)

如果您希望Nutch抓取并索引您的pdf文档,您必须启用文档抓取和Tika插件:

  1. 文件抓取

    1.1编辑regex-urlfilter.txt并删除任何“pdf”

    # skip image and other suffixes we can't yet parse
    # for a more extensive coverage use the urlfilter-suffix plugin
    -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
    

    1.2编辑suffix-urlfilter.txt并删除任何“pdf”

    1.3编辑nutch-site.xml,在plugin.includes部分添加“parse-tika”和“parse-html”

    <property>
      <name>plugin.includes</name>
      <value>protocol-http|urlfilter-regex|parse-(html|tika|text)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
      <description>Regular expression naming plugin directory names to
      include.  Any plugin not matching this expression is excluded.
      In any case you need at least include the nutch-extensionpoints plugin. By
      default Nutch includes crawling just HTML and plain text via HTTP,
      and basic indexing and search plugins. In order to use HTTPS please enable 
      protocol-httpclient, but be aware of possible intermittent problems with the 
      underlying commons-httpclient library.
      </description>
    </property>
    
  2. 如果您真正想要的是从页面下载所有pdf文件,您可以在* nix中使用Teleport in Windows或Wget。

答案 1 :(得分:-1)

您可以编写自己的插件,pdf mimetype
或者有嵌入式apache-tika解析器,可以从pdf中检索文本..