我正在用Java编写一个蜘蛛程序,我遇到了处理URL重定向的麻烦。到目前为止,我遇到过两种URL重定向,第一种是HTTP响应代码3xx,我可以按照this answer进行操作。
但第二种是服务器返回HTTP响应代码200,其页面只包含一些JavaScript代码:
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<script>
function detectmob() {
var u=(document.URL);
if( navigator.userAgent.match(/Android/i) || some other browser...){
window.location.href="web/mobile/index.php";
} else {
window.location.href="web/desktop/index.php";
}
}
detectmob();
</script>
</head>
<body></body></html>
如果原始网址为http://example.com,则如果我使用启用了JavaScript的桌面网络浏览器,则会自动重定向到http://example.com/web/desktop/index.php。
但是,我的蜘蛛检查HttpURLConnection#getResponseCode()
以查看是否已到达最终网址,方法是获取HTTP response code 200
并使用URLConnection#getHeaderField()
获取Location
字段,如果HTTP response code 3xx
收到了。以下是我的蜘蛛的代码片段:
public String getFinalUrl(String originalUrl) {
try {
URLConnection con = new URL(originalUrl).openConnection();
HttpURLConnection hCon = (HttpURLConnection) con;
hCon.setInstanceFollowRedirects(false);
if(hCon.getResponseCode() == HttpURLConnection.HTTP_MOVED_PERM
|| hCon.getResponseCode() == HttpURLConnection.HTTP_MOVED_TEMP) {
System.out.println("redirected url: " + con.getHeaderField("Location"));
return getFinalUrl(con.getHeaderField("Location"));
}
} catch (IOException ex) {
System.err.println(ex.toString());
}
return originalUrl;
}
因此,上面的页面将有一个HTTP response code 200
,我的蜘蛛将假设没有进一步的重定向,并开始解析在内容文本方面为空的页面。
我有点谷歌这个问题,显然javax.script
在某种程度上是相关的,但我不知道如何让它工作。如何编程我的蜘蛛以便它能够获得正确的URL?
答案 0 :(得分:0)
这是一个使用Apache HttpClient处理响应代码重定向的解决方案,Jsoup从html中提取javascript,然后使用正则表达式从重定向获取重定向字符串可以在javascript中执行重定向。
package com.yourpackage;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.StringWriter;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.apache.http.HttpResponse;
import org.apache.http.client.HttpClient;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.HttpClientBuilder;
import org.jsoup.Jsoup;
import org.jsoup.helper.StringUtil;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import com.google.common.base.Joiner;
import com.google.common.net.HttpHeaders;
public class CrawlHelper {
/**
* Get end contents of a urlString. Status code is not checked here because
* org.apache.http.client.HttpClient effectively handles the 301 redirects.
*
* Javascript is extracted using Jsoup, and checked for references to
* "window.location.replace".
*
* @param urlString Url. "http" will be prepended if https or http not already there.
* @return Result after all redirects, including javascript.
* @throws IOException
*/
public String getResult(final String urlString) throws IOException {
String html = getTextFromUrl(urlString);
Document doc = Jsoup.parse(html);
for (Element script : doc.select("script")) {
String potentialURL = getTargetLocationFromScript(urlString, script.html());
if (potentialURL.indexOf("/") == 0) {
potentialURL = Joiner.on("").join(urlString, potentialURL);
}
if (!StringUtil.isBlank(potentialURL)) {
return getTextFromUrl(potentialURL);
}
}
return html;
}
/**
*
* @param urlString Will be prepended if the target location doesn't start with "http".
* @param js Javascript to scan.
* @return Target that matches window.location.replace or window.location.href assignments.
* @throws IOException
*/
String getTargetLocationFromScript(String urlString, String js) throws IOException {
String potentialURL = getTargetLocationFromScript(js);
if (potentialURL.indexOf("http") == 0) {
return potentialURL;
}
return Joiner.on("").join(urlString, potentialURL);
}
String getTargetLocationFromScript(String js) throws IOException {
int i = js.indexOf("window.location.replace");
if (i > -1) {
return getTargetLocationFromLocationReplace(js);
}
i = js.indexOf("window.location.href");
if (i > -1) {
return getTargetLocationFromHrefAssign(js);
}
return "";
}
private String getTargetLocationFromHrefAssign(String js) {
return findTargetFrom("window.location.href\\s?=\\s?\\\"(.+)\\\"", js);
}
private String getTargetLocationFromLocationReplace(String js) throws IOException {
return findTargetFrom("window.location.replace\\(\\\"(.+)\\\"\\)", js);
}
private String findTargetFrom(String regex, String js) {
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(js);
while (m.find()) {
String potentialURL = m.group(1);
if (!StringUtil.isBlank(potentialURL)) {
return potentialURL;
}
}
return "";
}
private String getTextFromUrl(String urlString) throws IOException {
if (StringUtil.isBlank(urlString)) {
throw new IOException("Supplied URL value is empty.");
}
String httpUrlString = prependHTTPifNecessary(urlString);
HttpClient client = HttpClientBuilder.create().build();
HttpGet request = new HttpGet(httpUrlString);
request.addHeader("User-Agent", HttpHeaders.USER_AGENT);
HttpResponse response = client.execute(request);
try (BufferedReader rd =
new BufferedReader(new InputStreamReader(response.getEntity().getContent()))) {
StringWriter result = new StringWriter();
String line = "";
while ((line = rd.readLine()) != null) {
result.append(line);
}
return result.toString();
}
}
private String prependHTTPifNecessary(String urlString) throws IOException {
if (urlString.indexOf("http") != 0) {
return Joiner.on("://").join("http", urlString);
}
return validateURL(urlString);
}
private String validateURL(String urlString) throws IOException {
try {
new URL(urlString);
} catch (MalformedURLException mue) {
throw new IOException(mue);
}
return urlString;
}
}
TDD ...修改/增强以匹配各种场景:
package com.yourpackage;
import java.io.IOException;
import org.junit.Assert;
import org.junit.Test;
public class CrawlHelperTest {
@Test
public void testRegex() throws IOException {
String targetLoc =
new CrawlHelper().getTargetLocationFromScript("somesite.com", "function goHome() { window.location.replace(\"/s/index.html\")}");
Assert.assertEquals("somesite.com/s/index.html", targetLoc);
targetLoc =
new CrawlHelper().getTargetLocationFromScript("window.location.href=\"web/mobile/index.php\";");
Assert.assertEquals("web/mobile/index.php", targetLoc);
}
@Test
public void testCrawl() throws IOException {
Assert.assertTrue(new CrawlHelper().getResult("somesite.com").indexOf("someExpectedContent") > -1);
}
}