Crawler4j抓取jquery直播内容

时间:2014-05-17 16:46:37

标签: java crawler4j

我有一个网站,但在其类别页面上,通过javascript加载页面后生成的产品列表。我的爬行器去了它,无法找到任何产品。我该如何解决这个问题?

        CrawlConfig config = new CrawlConfig();
        config.setCrawlStorageFolder(rootFolder);
        config.setMaxPagesToFetch(100000000);
        config.setMaxDepthOfCrawling(-1);
        config.setPolitenessDelay(1);
        config.setUserAgentString("Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.146 Safari/537.36");
        //config.setResumableCrawling(true);
        config.setIncludeHttpsPages(true);



        PageFetcher pageFetcher = new PageFetcher(config);
        RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
        robotstxtConfig.setEnabled(false);
        RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
        CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer);



        controller.addSeed(siteDomain);
        for(int i = 4; i<=14; i++)
        {
            if(i < args.length)
        {
            controller.addSeed(args[i]);
        }
        }



        controller.start(Crawling.class, numberOfCrawlers);


        List<Object> crawlersLocalData = controller.getCrawlersLocalData();

1 个答案:

答案 0 :(得分:0)

不幸的是,crawler4j仅支持静态内容。对于javascript和ajax支持,请使用crawljax或nutch与selenium等爬虫。