我需要帮助找出如何抓取此页面: http://www.marinetraffic.com/en/ais/index/ports/all 遍历每个端口,提取名称和坐标并将其写入文件。 主要类看起来如下:
import java.io.FileWriter;
import edu.uci.ics.crawler4j.crawler.CrawlConfig;
import edu.uci.ics.crawler4j.crawler.CrawlController;
import edu.uci.ics.crawler4j.fetcher.PageFetcher;
import edu.uci.ics.crawler4j.robotstxt.RobotstxtConfig;
import edu.uci.ics.crawler4j.robotstxt.RobotstxtServer;
public class WorldPortSourceCrawler {
public static void main(String[] args) throws Exception {
String crawlStorageFolder = "data";
int numberOfCrawlers = 5;
CrawlConfig config = new CrawlConfig();
config.setCrawlStorageFolder(crawlStorageFolder);
config.setMaxDepthOfCrawling(2);
config.setUserAgentString("Sorry for any inconvenience, I am trying to keep the traffic low per second");
//config.setPolitenessDelay(20);
/*
* Instantiate the controller for this crawl.
*/
PageFetcher pageFetcher = new PageFetcher(config);
RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer);
/*
* For each crawl, you need to add some seed urls. These are the first
* URLs that are fetched and then the crawler starts following links
* which are found in these pages
*/
controller.addSeed("http://www.marinetraffic.com/en/ais/index/ports/all");
/*
* Start the crawl. This is a blocking operation, meaning that your code
* will reach the line after this only when crawling is finished.
*/
controller.start(PortExtractor.class, numberOfCrawlers);
System.out.println("finished reading");
System.out.println("Ports: " + PortExtractor.portList.size());
FileWriter writer = new FileWriter("PortInfo2.txt");
System.out.println("Writing to file...");
for(Port p : PortExtractor.portList){
writer.append(p.print() + "\n");
writer.flush();
}
writer.close();
System.out.println("File written");
}
}
Port Extractor看起来像这样:
public class PortExtractor extends WebCrawler{
private final static Pattern FILTERS = Pattern.compile(
".*(\\.(css|js|bmp|gif|jpe?g"
+ "|png|tiff?|mid|mp2|mp3|mp4"
+ "|wav|avi|mov|mpeg|ram|m4v|pdf"
+ "|rm|smil|wmv|swf|wma|zip|rar|gz))$"
);
public static List<Port> portList = new ArrayList<Port>();
/**
*
* Crawling logic
*/
//@Override
public boolean shouldVisit(WebURL url) {
String href = url.getURL().toLowerCase();
//return !FILTERS.matcher(href).matches()&&href.startsWith("http://www.worldportsource.com/countries.php") && !href.contains("/shipping/") && !href.contains("/cruising/") && !href.contains("/Today's Port of Call/") && !href.contains("/cruising/") && !href.contains("/portcall/") && !href.contains("/localviews/") && !href.contains("/commerce/")&& !href.contains("/maps/") && !href.contains("/waterways/");
return !FILTERS.matcher(href).matches() && href.startsWith("http://www.marinetraffic.com/en/ais/index/ports/all");
}
/**
* This function is called when a page is fetched and ready
* to be processed
*/
@Override
public void visit(Page page) {
String url = page.getWebURL().getURL();
System.out.println("URL: " + url);
}
}
我如何编写html解析器,我如何指定程序不应该抓取除端口信息链接以外的任何内容? 即使代码运行,我也遇到了困难,每当我尝试使用HTML解析时它就会中断。请大家帮忙。
答案 0 :(得分:2)
首要任务是检查网站的robots.txt以便核对,crawler4j
是否会对此网站进行实际抓取。我们发现调查此文件不会有问题:
User-agent: *
Allow: /
Disallow: /mob/
Disallow: /upload/
Disallow: /users/
Disallow: /wiki/
其次,我们需要弄清楚哪些链接对您的目的特别感兴趣。这需要一些人工调查。我只检查了上面提到的链接的一些条目,但我发现,每个端口在其链接中都包含关键字ports
,例如。
http://www.marinetraffic.com/en/ais/index/ports/all/per_page:50
http://www.marinetraffic.com/en/ais/details/ports/18853/China_port:YANGZHOU
http://www.marinetraffic.com/en/ais/details/ports/793/Korea_port:BUSAN
通过此信息,我们可以以白名单方式修改shouldVisit
方法。
public boolean shouldVisit(Page referringPage, WebURL url){
String href = url.getURL().toLowerCase();
return !FILTERS.matcher(href).matches()
&& href.contains("www.marinetraffic.com");
&& href.contains("ports");
}
这是一个非常简单的实现,可以通过正则表达式来增强。
第三,我们需要从HTML中解析数据。您要查找的信息包含在以下<div>
部分中:
<div class="bg-info bg-light padding-10 radius-4 text-left">
<div>
<span>Latitude / Longitude: </span>
<b>1.2593655° / 103.75445°</b>
<a href="/en/ais/home/zoom:14/centerx:103.75445/centery:1.2593655" title="Show on Map"><img class="loaded" src="/img/icons/show_on_map_magnify.png" data-original="/img/icons/show_on_map_magnify.png" alt="Show on Map" title="Show on Map"></a>
<a href="/en/ais/home/zoom:14/centerx:103.75445/centery:1.2593655/showports:1" title="Show on Map">Show on Map</a>
</div>
<div>
<span>Local Time:</span>
<b><time>2016-12-11 19:20</time> [UTC +8]</b>
</div>
<div>
<span>Un/locode: </span>
<b>SGSIN</b>
</div>
<div>
<span>Vessels in Port: </span>
<b><a href="/en/ais/index/ships/range/port_id:290/port_name:SINGAPORE">1021</a></b>
</div>
<div>
<span>Expected Arrivals: </span>
<b><a href="/en/ais/index/eta/all/port:290/portname:SINGAPORE">1059</a></b>
</div>
</div>
基本上,我会使用HTML Parser(例如Jericho)来执行此任务。然后,您可以准确提取正确的<div>
部分并获取您要查找的属性。