通过JavaScript重定向抓取

时间:2017-04-06 18:08:18

标签: javascript java web-crawler url-redirection

我正在用Java编写一个蜘蛛程序,我遇到了处理URL重定向的麻烦。到目前为止,我遇到过两种URL重定向,第一种是HTTP响应代码3xx,我可以按照this answer进行操作。

但第二种是服务器返回HTTP响应代码200,其页面只包含一些JavaScript代码:

<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<script>
function detectmob() { 
    var u=(document.URL);
    if( navigator.userAgent.match(/Android/i) || some other browser...){
        window.location.href="web/mobile/index.php";
    } else {
        window.location.href="web/desktop/index.php";
    }
}

detectmob();
</script>
</head>
<body></body></html>

如果原始网址为http://example.com,则如果我使用启用了JavaScript的桌面网络浏览器,则会自动重定向到http://example.com/web/desktop/index.php

但是,我的蜘蛛检查HttpURLConnection#getResponseCode()以查看是否已到达最终网址,方法是获取HTTP response code 200并使用URLConnection#getHeaderField()获取Location字段,如果HTTP response code 3xx收到了。以下是我的蜘蛛的代码片段:

public String getFinalUrl(String originalUrl) {
        try {
            URLConnection con = new URL(originalUrl).openConnection();
            HttpURLConnection hCon = (HttpURLConnection) con;
            hCon.setInstanceFollowRedirects(false);
            if(hCon.getResponseCode() == HttpURLConnection.HTTP_MOVED_PERM 
                    || hCon.getResponseCode() == HttpURLConnection.HTTP_MOVED_TEMP) {
                System.out.println("redirected url: " + con.getHeaderField("Location"));
                return getFinalUrl(con.getHeaderField("Location"));
            }
        } catch (IOException ex) {
            System.err.println(ex.toString());
        }

        return originalUrl;
    }

因此,上面的页面将有一个HTTP response code 200,我的蜘蛛将假设没有进一步的重定向,并开始解析在内容文本方面为空的页面。

我有点谷歌这个问题,显然javax.script在某种程度上是相关的,但我不知道如何让它工作。如何编程我的蜘蛛以便它能够获得正确的URL?

1 个答案:

答案 0 :(得分:0)

这是一个使用Apache HttpClient处理响应代码重定向的解决方案,Jsoup从html中提取javascript,然后使用正则表达式从重定向获取重定向字符串可以在javascript中执行重定向。

package com.yourpackage;

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.StringWriter;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import org.apache.http.HttpResponse;
import org.apache.http.client.HttpClient;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.HttpClientBuilder;
import org.jsoup.Jsoup;
import org.jsoup.helper.StringUtil;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;

import com.google.common.base.Joiner;
import com.google.common.net.HttpHeaders;

public class CrawlHelper {

  /**
   * Get end contents of a urlString. Status code is not checked here because
   * org.apache.http.client.HttpClient effectively handles the 301 redirects.
   * 
   * Javascript is extracted using Jsoup, and checked for references to
   * &quot;window.location.replace&quot;.
   * 
   * @param urlString Url. &quot;http&quot; will be prepended if https or http not already there.
   * @return Result after all redirects, including javascript.
   * @throws IOException
   */
  public String getResult(final String urlString) throws IOException {
    String html = getTextFromUrl(urlString);
    Document doc = Jsoup.parse(html);
    for (Element script : doc.select("script")) {
      String potentialURL = getTargetLocationFromScript(urlString, script.html());
      if (potentialURL.indexOf("/") == 0) {
        potentialURL = Joiner.on("").join(urlString, potentialURL);
      }
      if (!StringUtil.isBlank(potentialURL)) {
        return getTextFromUrl(potentialURL);
      }
    }
    return html;
  }

  /**
   * 
   * @param urlString Will be prepended if the target location doesn't start with &quot;http&quot;.
   * @param js Javascript to scan.
   * @return Target that matches window.location.replace or window.location.href assignments.
   * @throws IOException
   */
  String getTargetLocationFromScript(String urlString, String js) throws IOException {
    String potentialURL = getTargetLocationFromScript(js);
    if (potentialURL.indexOf("http") == 0) {
      return potentialURL;
    }
    return Joiner.on("").join(urlString, potentialURL);
  }

  String getTargetLocationFromScript(String js) throws IOException {
    int i = js.indexOf("window.location.replace");
    if (i > -1) {
      return getTargetLocationFromLocationReplace(js);
    }
    i = js.indexOf("window.location.href");    
    if (i > -1) {
      return getTargetLocationFromHrefAssign(js);
    }
    return "";
  }

  private String getTargetLocationFromHrefAssign(String js) {
    return findTargetFrom("window.location.href\\s?=\\s?\\\"(.+)\\\"", js);
  }

  private String getTargetLocationFromLocationReplace(String js) throws IOException {
    return findTargetFrom("window.location.replace\\(\\\"(.+)\\\"\\)", js);
  }

  private String findTargetFrom(String regex, String js) {
    Pattern p = Pattern.compile(regex);
    Matcher m = p.matcher(js);
    while (m.find()) {
      String potentialURL = m.group(1);
      if (!StringUtil.isBlank(potentialURL)) {
        return potentialURL;
      }
    }
    return "";
  }

  private String getTextFromUrl(String urlString) throws IOException {
    if (StringUtil.isBlank(urlString)) {
      throw new IOException("Supplied URL value is empty.");
    }
    String httpUrlString = prependHTTPifNecessary(urlString);
    HttpClient client = HttpClientBuilder.create().build();
    HttpGet request = new HttpGet(httpUrlString);
    request.addHeader("User-Agent", HttpHeaders.USER_AGENT);
    HttpResponse response = client.execute(request);
    try (BufferedReader rd =
        new BufferedReader(new InputStreamReader(response.getEntity().getContent()))) {
      StringWriter result = new StringWriter();
      String line = "";
      while ((line = rd.readLine()) != null) {
        result.append(line);
      }
      return result.toString();
    }
  }

  private String prependHTTPifNecessary(String urlString) throws IOException {
    if (urlString.indexOf("http") != 0) {
      return Joiner.on("://").join("http", urlString);
    }
    return validateURL(urlString);
  }

  private String validateURL(String urlString) throws IOException {
    try {
      new URL(urlString);
    } catch (MalformedURLException mue) {
      throw new IOException(mue);
    }
    return urlString;
  }
}

TDD ...修改/增强以匹配各种场景:

package com.yourpackage;

import java.io.IOException;

import org.junit.Assert;
import org.junit.Test;

public class CrawlHelperTest {

  @Test
  public void testRegex() throws IOException {
    String targetLoc = 
    new CrawlHelper().getTargetLocationFromScript("somesite.com", "function goHome() { window.location.replace(\"/s/index.html\")}");
    Assert.assertEquals("somesite.com/s/index.html", targetLoc);
    targetLoc = 
        new CrawlHelper().getTargetLocationFromScript("window.location.href=\"web/mobile/index.php\";");
    Assert.assertEquals("web/mobile/index.php", targetLoc);
  }

  @Test
  public void testCrawl() throws IOException {
    Assert.assertTrue(new CrawlHelper().getResult("somesite.com").indexOf("someExpectedContent") > -1);
  }

}