我在抓取页面上的一些数据方面遇到了很多麻烦(一个例子是http://www.arena-offshore.com/crew-boats.html)。我们有权获取数据,但他们太忙了#34;以任何形式将它交给我们。
我已经尝试过针对Chrome的Web Scraper插件,import.io,并开始研究更复杂的程序,但它们有点超出我的范围。对于初学者来说,没有一个程序似乎能够识别出每个船只的不同链接,所以我甚至可以达到抓取各个领域的程度。所以我想如果有人知道如何抓住每艘船的不同链接,我可以解决剩下的问题。有人有主意吗?我知道我的技能并不是最好的,但希望有人能指出我正确的方向。
非常感谢
亚历
答案 0 :(得分:0)
如何将Java与HTML解析器jsoup一起使用? Jsoup是一个很好的工具,用于解析网站(如果它们不依赖于javascript),同时使用CSS选择器来获取特定的HTML元素。
Java代码
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
public class Scraper {
List<Category> categories = new ArrayList<>();
static class Category {
private int categoryNumber;
private String title;
public Category(int categoryNumber, String title) {
this.categoryNumber = categoryNumber;
this.title = title;
}
public int getCategoryNumber() {
return categoryNumber;
}
public String getTitle() {
return title;
}
}
public Scraper(){
categories.add(new Category(1, "CREW BOATS"));
categories.add(new Category(2, "TUG BOATS"));
categories.add(new Category(3, "AHT & AHTS"));
categories.add(new Category(4, "SUPPLY/UTILITY VESSELS"));
categories.add(new Category(5, "BARGES"));
categories.add(new Category(6, "MISCELLANEOUS"));
}
private void scrapeCategory(Category category){
System.out.println("\n"+category.getTitle()+"\n");
String searchUrl = "http://www.arena-offshore.com/iframe/list/index.php?category=" + category.getCategoryNumber() + "&page=";
int pageIndex=1;
Document doc;
while (true) {
try {
doc = Jsoup.connect(searchUrl + pageIndex)
.userAgent(
"Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36")
.referrer("http://www.arena-offshore.com/").get();
if (doc.select("#pu132").isEmpty()) { // no more results
break;
}
for (Element element : doc.select("#pu132")) {
String boat = element.select("[data-muse-uid=\"U159\"]").first().text(); //ID
boat += "\n\t\t" + element.select("a").first().attr("href"); //HREF
boat += "\n\t\t" + element.select("[data-muse-uid=\"U158\"]").get(0).text(); //TYPE
boat += "\n\t\t" + element.select("[data-muse-uid=\"U158\"]").get(1).text(); // LOCATION
boat += "\n\t\t" + element.select("[data-muse-uid=\"U153\"]").first().text(); // BRIEF DETAILS
System.out.println("\t"+boat);
}
pageIndex++;
} catch (IOException e) {
e.printStackTrace();
}
}
}
public void scrapeAllCategories(){
for (Category category : categories) {
scrapeCategory(category);
}
}
public static void main(String[] args) {
new Scraper().scrapeAllCategories();
}
}
注意:您需要download the jsoup core library和add it to your build path。
<强>输出强>
CREW BOATS
AR-C1002
http://www.arena-offshore.com/agent-boat-for-sale-AR-C1002.html
AGENT BOAT
EAST MED.
FOR SALE 14 X 4 X 1.9 (DEPTH)M, 2007 BUILT 630 BHP, 12 PERSONEL, IN EAST MED.
...
AR-C1000
http://www.arena-offshore.com/AR-KED.html?page=13
CREW BOAT
SOUTH AMERICA
FOR SALE 17 X 5 X 2.18M, 2009 BUILT 1200 BHP, IN SOUTH AMERICA
TUG BOATS
AR-KTK
http://www.arena-offshore.com/single-screw-tug-boat-AR-KTK.html
SINGLE SCREW
TURKEY
FOR SALE 1998 BUILT/ 2008 REBUILT 1000 HP / 16 TBP
...
AHT & AHTS
AR-RZA
http://www.arena-offshore.com/AR-RZA.html
ANCHOR HANDLING TUG / TOWING
AFRICA
FOR SALE 36 X 10 X 4 M (MAX DRAFT) 4400 BHP / 58 TBP
...
SUPPLY/UTILITY VESSELS
AR-U5001
http://www.arena-offshore.com/survey-vessel-in-south-east-asia-AR-U5001.html
SURVEY SUPPOT VESSEL
SOUTH EAST ASIA
FOR SALE 20 X 6 X 1.5 (DRAFT)M, 2012 BUILT, IRS CLASS 650 BHP, 50 M2 DECK SPACE
...
BARGES
AR-KLM
http://www.arena-offshore.com/AR-KLM.html
ACCOMMODATION
SHETLANDS
FOR CHARTER 1993 BUILT MAJOR CONVERSION 2004 AND 2013
...
MISCELLANEOUS
AR-SAA
http://www.arena-offshore.com/AR-SAA.html
SHALLOW DRAFT MPP / WORKBOAT
-
FOR CHARTER 2800 HP / 37.3 TBP CERTIFIED
...
注意:缩短输出,而不是...
打印出更多结果。