从.asp网址抓取网页

时间:2018-02-02 02:01:28

标签: java asp.net url web-scraping

我试图从不同机场之间的路线上的site中提取数据。用户打算选择两个机场,然后程序将在给定的一天向他们显示所有不同的路线。只有在网站上搜索路由后,无论您正在查看哪条路由,网址都会更改为相同的.asp域名。有没有办法在不知道URL的情况下从特定路由中抓取数据,或者是否有可能获得真正的URL?

2 个答案:

答案 0 :(得分:10)

我建议使用JSoup。为此,请在下面添加pom.xml

<dependency>
  <groupId>org.jsoup</groupId>
  <artifactId>jsoup</artifactId>
  <version>1.11.2</version>
</dependency>

然后你发出第一个要求做饭的请求

    Connection.Response initialPage = Jsoup.connect("https://www.flightview.com/flighttracker/")
            .headers(headers)
            .method(Connection.Method.GET)
            .userAgent(userAgent)
            .execute();
    Map<String, String> initialCookies = initialPage.cookies();

然后用这些cookie发出下一个请求

    Connection.Response flights = Jsoup.connect("https://www.flightview.com/TravelTools/FlightTrackerQueryResults.asp")
            .userAgent(userAgent)
            .headers(headers)
            .data(postData)
            .cookies(initialCookies)
            .method(Connection.Method.POST)
            .execute();

在这种情况下,postDataheaders

    HashMap<String, String> postData = new HashMap<String, String>();
    HashMap<String, String> headers = new HashMap<String, String>();

    headers.put("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8");
    headers.put("Accept-Encoding", "gzip, deflate, br");
    headers.put("Accept-Language", "en-US,en;q=0.9");
    headers.put("Cache-Control", "no-cache");
    headers.put("DNT", "1");
    headers.put("Pragma", "no-cache");
    headers.put("Upgrade-Insecure-Requests", "1");

    postData.put("qtype", "cpi");
    postData.put("sfw", "/FV/FlightTracker/Main");
    postData.put("namdep", "DFW Dallas, TX (Dallas/Ft Worth) - Dallas Fort Worth International");
    postData.put("depap", "DFW");
    postData.put("namarr", "JFK New York, NY (Kennedy) - John F Kennedy International");
    postData.put("arrap", "JFK");
    postData.put("namal2", "Enter name or code");
    postData.put("al", "");
    postData.put("whenArrDep", "dep");
    postData.put("whenHour", "all");
    postData.put("whenDate", "20180321");
    postData.put("input", "Track Flight");

现在,当您获得数据时,您可以解析并打印出来的东西

    String page = flights.body();
    System.out.println(page);
    Document doc = Jsoup.parse(page);
    Elements elems = doc.select("tr.FlightTrackerListRowOdd, tr.FlightTrackerListRowEven");

    for(Element element : elems) {
        Elements childElems = element.select("td");
        String text1 =  childElems.get(0).text();
        String text2 =  childElems.get(1).text();
        System.out.println(text1 + " " + text2);
    }

同样的输出是

Aeroflot Airlines 3453
Aeroflot Airlines 3455
AeroMexico 4966
AeroMexico 4935
Air France 2535
Alitalia 3403
American Airlines 1294
British Airways 1880
China Eastern Airlines 8804
Delta Air Lines 3869
Delta Air Lines 3789
Etihad Airways 3040
Finnair 5726
Gulf Air 4139
Iberia Airlines 4043
Jet Airways 7692
KLM Royal Dutch Airlines 6597
KLM Royal Dutch Airlines 8117
Korean Air 7326
Malaysia Airlines 9442
Qatar Airways 5107
TAM Brazilian Airlines 8379
Virgin Atlantic 4620
Virgin Atlantic 3471

休息你可以根据自己的需要开始改变。这表明你可以举例说明如何做到这一点

答案 1 :(得分:6)

在浏览器中打开开发者工具,并在搜索框中提交信息,以便到达目的地并提交。

然后,如果您检查浏览器发送给服务器的请求,您会发现包含您刚刚提交的表单数据的帖子请求已发送到https://www.flightview.com/TravelTools/FlightTrackerQueryResults.asp

如果您要抓取此数据,则可以使用python requests模块向此网址发送帖子请求。

注意:由于您使用的是Java,因此仍然可以发送简单的发布请求。您可以查看如何发送帖子请求here