Question

我目前正在使用crawler4j作为我选择的网络抓取工具，我正在努力教会自己网络抓取工具的工作原理。我已经开始抓取了，我希望它能够快速返回crawlStorageFolder（/ data / crawl / root）下面的抓取数据

public class Controller {

    public static void main(String[] args) throws Exception {


            /*
             * crawlStorageFolder is a folder where intermediate crawl data is
             * stored.
             */
            String crawlStorageFolder =  "/data/crawl/root";


            /*
             * numberOfCrawlers shows the number of concurrent threads that should
             * be initiated for crawling.
             */
            int numberOfCrawlers = 7;



            CrawlConfig config = new CrawlConfig();

            config.setCrawlStorageFolder(crawlStorageFolder);

问题是我能找到的唯一信息是在crawlStorageFolder位置的两个.lck文件和一个.jdb文件，我假设它是存储数据的地方，但我也无法打开它们。有人会非常友好地帮助我理解我如何访问数据，以便我可以希望并成功地将其导入数据库并最终显示在我的网站上。非常感谢。

Answer 1

Crawler4j使用BerkeleyDB来存储抓取信息。请参阅来源中的here。

从命令行，您可以使用DB utils来访问数据。已经涵盖在SO here。

如果要访问Java代码中的数据，只需导入BerkeleyDB库（Maven指令there）并按照tutorial on how to open the DB进行操作。

Answer 2

您不应该使用该文件夹中的数据。您应该将该数据视为抓取工具的内部数据。您始终可以在WebCrawler的访问方法中转储/写入爬网数据。

访问通过网络爬虫存储的.lck和jdb文件

2 个答案: