我需要捕获网页在其加载过程中从到请求资源的所有主机。 目前,我正在使用带有PhantomJSDriver的Selenium和Browsermob代理来生成har文件。生成文件后,我可以从har-logs解析页面在加载过程中执行的所有HTTP请求:
public static void main(String[] args) throws IOException {
// BrowserMobProxy
BrowserMobProxy server = new BrowserMobProxyServer();
server.start(0);
server.setHarCaptureTypes(CaptureType.getAllContentCaptureTypes());
server.enableHarCaptureTypes(CaptureType.RESPONSE_COOKIES, CaptureType.REQUEST_COOKIES,
CaptureType.REQUEST_HEADERS, CaptureType.RESPONSE_HEADERS, CaptureType.REQUEST_CONTENT,
CaptureType.RESPONSE_CONTENT);
Proxy seleniumProxy = ClientUtil.createSeleniumProxy(server);
// PHANTOMJS_CLI_ARGS
ArrayList<String> cliArgsCap = new ArrayList<>();
cliArgsCap.add("--proxy=localhost:" + server.getPort());
cliArgsCap.add("--ignore-ssl-errors=yes");
// DesiredCapabilities
DesiredCapabilities capabilities = new DesiredCapabilities();
capabilities.setCapability(CapabilityType.PROXY, seleniumProxy);
capabilities.setCapability(CapabilityType.ACCEPT_SSL_CERTS, true);
capabilities.setCapability(CapabilityType.SUPPORTS_JAVASCRIPT, true);
capabilities.setCapability(PhantomJSDriverService.PHANTOMJS_CLI_ARGS, cliArgsCap);
capabilities.setCapability(PhantomJSDriverService.PHANTOMJS_EXECUTABLE_PATH_PROPERTY,
"...");
//connect to website using webdriver
Set<String> hosts = new HashSet<>();
WebDriver driver = new PhantomJSDriver(capabilities);
//generate har-file
server.newHar();
String site = "...";
driver.get(site);
//parse information from har-file
Har har = server.getHar();
for (HarEntry entry : har.getLog().getEntries()) {
if (!entry.getRequest().getUrl().contains(new URL(site).getHost())) {
for (HarNameValuePair h : entry.getRequest().getHeaders()) {
if(h.getName().equals("Host"))
{
if(!hosts.contains(h.getValue()))
{
hosts.add(h.getValue());
}
}
}
}
}
server.stop();
driver.close();
让我感到困扰的是,使用Selenium Webdriver的速度非常慢并且占用大量内存。 不幸的是,我对Selenium和BMP(或一般而言的Web开发)不是很有经验。有没有一种方法可以在不使用Selenium的情况下使用BMP生成har文件?还是获取我所需信息的更好方法?提前致谢。