我想用Java创建一个简单的Web爬虫。 我正在尝试使用此代码
WebDriver driver = new HtmlUnitDriver();
driver.get("https://codereview.qt-project.org/#change,70");
String pageSource=driver.getPageSource();
System.out.println(pageSource);
所以我得到了这个源代码>>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/REC-html40/strict.dtd">
<html><head><META http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>Gerrit Code Review</title><meta content="locale=en_US" name="gwt:property">
<script language="javascript" type="text/javascript">var gerrit_hostpagedata={"config":
{"useContributorAgreements":true,"useContactInfo":false,"allowRegisterNewEmail":false,
但是内容是用JavaScript生成的,我想获取HTML快照。
答案 0 :(得分:1)
创建启用Javascript的驱动程序..
WebDriver driver = new HtmlUnitDriver(true);
结果:
<?xml version="1.0" encoding="UTF-8"?>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
<title>
codereview.qt-project Code Review
</title>
<meta content="locale=en_US" name="gwt:property"/>
<script language="javascript" type="text/javascript">
//<![CDATA[
var gerrit_hostpagedata={"config":{"useContributorAgreements":true,"useContactInfo":false,"allowRegisterNewEmail":false,"authType":"HTTP","downloadSchemes":["DEFAULT_DOWNLOADS"],"sshdAddress":"*:29418","wildProject":{"name":"All-Projects"},"approvalTypes":{"approvalTypes":[{"category":{"categoryId":{"id":"CRVW"},"name":"Code Review","abbreviatedName":"R","position":1,"functionName":"MaxWithBlock","copyMinScore":true,"labelName":"Code-Review"},"values":[{"key":{"categoryId":{"id":"CRVW"},"value":-2},"name":"This shall not be merged"},{"key":{"categoryId":{"id":"CRVW"},"value":-1},"name":"I would prefer this is not merged as is"},{"key":{"categoryId":{"id":"CRVW"},"value":0},"name":"No score"},{"key":{"categoryId":{"id":"CRVW"},"value":1},"name":"Looks good to me, but someone else must approve"},{"key":{"categoryId":{"id":"CRVW"},"value":2},"name":"Looks good to me, approved"}],"maxNegative":-2,"maxPositive":2},{"category":{"categoryId":{"id":"SRVW"},"name":"Sanity Review","abbreviatedName":"S","position":2,"functionName":"MaxWithBlock","copyMinScore":false,"labelName":"Sanity-Review"},"values":[{"key":{"categoryId":{"id":"SRVW"},"value":-2},"name":"Major sanity problems found"},{"key":{"categoryId":{"id":"SRVW"},"value":-1},"name":"Sanity problems found"},{"key":{"categoryId":{"id":"SRVW"},"value":0},"name":"No sanity review "},{"key":{"categoryId":{"id":"SRVW"},"value":1},"name":"Sanity review passed"}],"maxNegative":-2,"maxPositive":1}]},"editableAccountFields":["REGISTER_NEW_EMAIL","USER_NAME","FULL_NAME"],"commentLinks":[{"find":"[Tt]ask-number:\\s+([\\w\\-]+)","replace":"\u003ca href\u003d\"http://bugreports.qt-project.org/browse/$1\"\u003e$\u0026\u003c/a\u003e"}],"documentationAvailable":false}};gerrit_hostpagedata.theme={"backgroundColor":"#FCFEEF","topMenuColor":"#44A51C","textColor":"#000000","trimColor":"#B6DCA6","selectionColor":"#FFFFCC"};
//]]>
</script>
<style type="text/css">
#gerrit_topmenu {
color: #ffffff;
}
#gerrit_topmenu .gwt-Label {
color: #ffffff;
}
#gerrit_topmenu .gwt-TabBarItem-selected .gwt-Label {
color: #000000;
}
#gerrit_topmenu a, #gerrit_topmenu a:visited, #gerrit_topmenu a:hover {
color: #ffffff;
}
#qt-footer-links {
background-color: #44A51C;
}
#qt-footer-links ul {
width: 100%;
margin: 0;
text-align: center;
padding: .1em 0 .3em 0;
}
#qt-footer-links li {
display: inline;
padding: .1em 1em;
}
#qt-footer-links a, #qt-footer-links a:visited, #qt-footer-links a:hover {
font-family: Arial;
color: white;
font-size: 11px;
font-weight: bold;
text-decoration: none;
}
</style>
<link href="favicon.ico" rel="icon" type="image/gif"/>
<link href="gerrit/gwt/chrome/30B802F72484AED7E67C91FE77CD50BD.cache.css" rel="stylesheet"/>
<link href="undefined" rel="stylesheet"/>
</head>
<body>
<div id="gerrit_topmenu" class="GCLMTUVDNF">
<table class="GCLMTUVDIK">
<colgroup>
<col/>
<col/>
<col/>
</colgroup>
<tbody>
<tr>
<td class="GCLMTUVDMK">
<table cellspacing="0" cellpadding="0" class="GCLMTUVDJK">
<tbody>
<tr>
<td align="left" style="vertical-align: top;">
<table cellspacing="0" cellpadding="0" class="gwt-TabBar" role="tablist" style="width: 100%;">
<tbody>
<tr>
<td align="left" style="vertical-align: bottom;" height="100%" class="gwt-TabBarFirst-wrapper">
<div class="gwt-TabBarFirst" style="white-space: normal; height: 100%;">
</div>
</td>
<td align="left" style="vertical-align: bottom;" class="gwt-TabBarItem-wrapper gwt-TabBarItem-wrapper-selected">
<div tabindex="0" class="gwt-TabBarItem gwt-TabBarItem-selected" role="tab">
<div class="gwt-Label" style="white-space: nowrap;">
All
</div>
</div>
</td>
<td align="left" style="vertical-align: bottom;" width="100%" class="gwt-TabBarRest-wrapper">
<div class="gwt-TabBarRest" style="white-space: normal; height: 100%;">
</div>
</td>
</tr>
</tbody>
</table>
</td>
</tr>
<tr>
<td align="left" style="vertical-align: top;" height="100%">
<div class="gwt-TabPanelBottom" role="tabpanel">
<div style="width: 100%; height: 100%; padding: 0px; margin: 0px;">
<div class="GCLMTUVDMG" role="menubar" style="width: 100%; height: 100%;">
<a class="GCLMTUVDPG GCLMTUVDNG" href="#q,status:open,n,z" role="menuitem">
Open
</a>
<a class="GCLMTUVDPG GCLMTUVDNG" href="#q,status:staged,n,z" role="menuitem">
Staged
</a>
<a class="GCLMTUVDPG GCLMTUVDNG" href="#q,status:integrating,n,z" role="menuitem">
Integrating
</a>
<a class="GCLMTUVDPG GCLMTUVDNG" href="#q,status:merged,n,z" role="menuitem">
Merged
</a>
<a class="GCLMTUVDPG GCLMTUVDNG" href="#q,status:deferred,n,z" role="menuitem">
Deferred
</a>
<a class="GCLMTUVDPG" href="#q,status:abandoned,n,z" role="menuitem">
Abandoned
</a>
</div>
</div>
</div>
</td>
</tr>
</tbody>
</table>
</td>
<td class="GCLMTUVDLK">
<div>
</div>
</td>
<td class="GCLMTUVDMK">
<div class="GCLMTUVDKK">
<div class="GCLMTUVDMG" role="menubar">
<a class="GCLMTUVDPG" href="javascript:;" role="menuitem">
Sign In
</a>
</div>
<div class="GCLMTUVDJJ">
<input type="text" class="gwt-TextBox GCLMTUVDHG" value="Change #, SHA-1, tr:id, owner:email or reviewer:email"/>
<button type="button" class="gwt-Button">
Search
</button>
</div>
</div>
</td>
</tr>
</tbody>
</table>
<div class="GCLMTUVDGJ">
<span class="GCLMTUVDEJ GCLMTUVDFJ" style="">
Loading ...
</span>
</div>
</div>
<div id="gerrit_header">
<div>
<img src="static/logo_open_gov.png" style="margin: 18px 0 0 10px;"/>
<img src="static/logo_qt.png" style="float: right; margin: 18px 28px 0 0;"/>
</div>
</div>
<div id="gerrit_body" class="GCLMTUVDMF">
<div>
<div style="display: none;">
<div class="GCLMTUVDHJ GCLMTUVDLB">
<div class="GCLMTUVDIJ">
<span class="gwt-InlineLabel">
</span>
</div>
<div>
<table cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td align="left" style="vertical-align: top;">
<table class="GCLMTUVDFG GCLMTUVDKB">
<colgroup>
<col/>
<col/>
</colgroup>
<tbody>
<tr>
<td class="header GCLMTUVDNK">
Change-Id:
</td>
<td class="GCLMTUVDNK GCLMTUVDBC">
</td>
</tr>
<tr>
<td class="header">
Owner
</td>
<td>
</td>
</tr>
<tr>
<td class="header">
Project
</td>
<td>
</td>
</tr>
<tr>
<td class="header">
Branch
</td>
<td>
</td>
</tr>
<tr>
<td class="header">
Topic
</td>
<td>
</td>
</tr>
<tr>
<td class="header">
Uploaded
</td>
<td>
</td>
</tr>
<tr>
<td class="header">
Updated
</td>
<td>
</td>
</tr>
<tr>
<td class="header GCLMTUVDDB">
Status
</td>
<td>
</td>
</tr>
<tr>
<td class="GCLMTUVDHI">
</td>
<td class="GCLMTUVDHI">
</td>
</tr>
</tbody>
</table>
</td>
<td align="left" style="vertical-align: top;">
<div class="GCLMTUVDMB">
</div>
</td>
</tr>
</tbody>
</table>
<div class="GCLMTUVDO">
<table class="GCLMTUVDGG">
<colgroup>
<col/>
<col/>
<col/>
<col/>
<col/>
</colgroup>
<tbody>
<tr>
<td class="header">
Reviewer
</td>
<td class="header">
</td>
<td class="header">
Code Review
</td>
<td class="header">
Sanity Review
</td>
<td class="header GCLMTUVDDJ">
</td>
</tr>
</tbody>
</table>
<ul class="GCLMTUVDCH">
</ul>
<div class="GCLMTUVDK" style="display: none;">
<div>
<input type="text" class="gwt-SuggestBox GCLMTUVDHG" value="Name or Email"/>
<button type="button" class="gwt-Button">
Add Reviewer
</button>
</div>
</div>
</div>
<table cellspacing="0" cellpadding="0" class="gwt-DisclosurePanel gwt-DisclosurePanel-closed">
<tbody>
<tr>
<td align="left" style="vertical-align: top;">
<a href="javascript:void(0);" style="display: block;" class="header">
<table>
<tbody>
<tr>
<td align="center" style="width: 16px;">
<img onload="this.__gwtLastUnhandledEvent="load";" src="https://codereview.qt-project.org/gerrit/clear.cache.gif" style="width: 16px; height: 16px; background: url(data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAAAfklEQVR42mNgoDZITk4WosiAtLS0M6mpqb1Amp9cAy4B8X8gfpWenp5MiQEwfB6IbSgxAIaXArEcJQaA8Ddg+NQVFhZykmsADG8MDQ1lJseA5wQDFocBP0FRm5WVxUNOGGwEJi4VcmLhKtC5HuSkg8NA5+bjDCRCAG8UDUoAAIw8kVdwMG+3AAAAAElFTkSuQmCC) no-repeat 0px 0px" border="0" class="gwt-Image"/>
</td>
<td>
Included in
</td>
</tr>
</tbody>
</table>
</a>
</td>
</tr>
<tr>
<td align="left" style="vertical-align: top;">
<div style="padding: 0px; overflow: hidden; display: none;">
<table class="content">
<colgroup>
<col/>
</colgroup>
<tbody>
<tr>
<td>
</td>
</tr>
</tbody>
</table>
</div>
</td>
</tr>
</tbody>
</table>
<table cellspacing="0" cellpadding="0" class="gwt-DisclosurePanel gwt-DisclosurePanel-closed">
<tbody>
<tr>
<td align="left" style="vertical-align: top;">
<a href="javascript:void(0);" style="display: block;" class="header">
<table>
<tbody>
<tr>
<td align="center" style="width: 16px;">
<img onload="this.__gwtLastUnhandledEvent="load";" src="https://codereview.qt-project.org/gerrit/clear.cache.gif" style="width: 16px; height: 16px; background: url(data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAAAfklEQVR42mNgoDZITk4WosiAtLS0M6mpqb1Amp9cAy4B8X8gfpWenp5MiQEwfB6IbSgxAIaXArEcJQaA8Ddg+NQVFhZykmsADG8MDQ1lJseA5wQDFocBP0FRm5WVxUNOGGwEJi4VcmLhKtC5HuSkg8NA5+bjDCRCAG8UDUoAAIw8kVdwMG+3AAAAAElFTkSuQmCC) no-repeat 0px 0px" border="0" class="gwt-Image"/>
</td>
<td>
Dependencies
</td>
</tr>
</tbody>
</table>
</a>
</td>
</tr>
<tr>
<td align="left" style="vertical-align: top;">
<div style="padding: 0px; overflow: hidden; display: none;">
<table class="GCLMTUVDOB content" style="width: auto;">
<colgroup>
<col/>
</colgroup>
<tbody>
<tr>
<td class="GCLMTUVDDG"/>
<td class="GCLMTUVDDG"/>
<td class="GCLMTUVDFB GCLMTUVDKD">
ID
</td>
<td class="GCLMTUVDKD">
Subject
</td>
<td class="GCLMTUVDKD">
Owner
</td>
<td class="GCLMTUVDKD">
Project
</td>
<td class="GCLMTUVDKD">
Branch
</td>
<td class="GCLMTUVDKD">
Updated
</td>
</tr>
<tr>
<td colspan="8" class="GCLMTUVDKJ">
Depends On
</td>
</tr>
<tr>
<td colspan="8" class="GCLMTUVDOE">
(None)
</td>
</tr>
<tr>
<td colspan="8" class="GCLMTUVDKJ">
Needed By
</td>
</tr>
<tr>
<td colspan="8" class="GCLMTUVDOE">
(None)
</td>
</tr>
</tbody>
</table>
</div>
</td>
</tr>
</tbody>
</table>
<table class="GCLMTUVDLJ">
<colgroup>
<col/>
<col/>
</colgroup>
<tbody>
<tr>
<td>
Old Version History:
</td>
<td>
<select class="gwt-ListBox">
<option value="Base" selected="selected">
Base
</option>
</select>
</td>
</tr>
</tbody>
</table>
<div>
</div>
<div class="GCLMTUVDJB">
</div>
</div>
</div>
</div>
</div>
</div>
<div style="clear: both; margin-top: 15px; padding-top: 2px; margin-bottom: 15px;">
<div id="gerrit_footer">
<div>
<div id="qt-footer-links">
<ul>
<li>
<a href="http://qt.digia.com/">
qt.digia.com
</a>
</li>
<li>
<a href="http://qt-project.org/doc/">
Qt Documentation
</a>
</li>
<li>
<a href="http://qt-project.org/">
Qt-Project
</a>
</li>
<li>
<a href="http://planet.qt-project.org/">
Planet Qt
</a>
</li>
<li>
<a href="http://qt.gitorious.org/">
Qt Repositories - Gitorious
</a>
</li>
<li>
<a href="http://bugreports.qt-project.org/">
Qt Bug Tracker - JIRA
</a>
</li>
</ul>
</div>
</div>
</div>
<div id="gerrit_btmmenu" style="clear: both;">
<div class="GCLMTUVDIG">
Press '?' to view keyboard shortcuts
</div>
<div class="GCLMTUVDAL">
Powered by
<a href="http://code.google.com/p/gerrit/" target="_blank">
Gerrit Code Review
</a>
(V2.2.1-NQT-012) |
<a href="http://code.google.com/p/gerrit/issues/list" target="_blank">
Report Bug
</a>
</div>
</div>
</div>
<iframe id="__gwt_historyFrame" src="javascript:''" style="position:absolute;width:0;height:0;border:0" tabindex="-1">
</iframe>
<script language="javascript" type="text/javascript">
//<![CDATA[
<!--
function gerrit(){var s,l,t,w=window,d=document,n='gerrit',f=d.createElement('iframe');function m(){if(s&&l){var b,i=d.createElement('img');i.src=n+'/clear.cache.gif';b=i.src;b=b.substring(0,b.lastIndexOf('/')+1);gerrit=null;f.contentWindow.gwtOnLoad(undefined,n,b);}}gerrit.onScriptLoad=function(){s=1;m();};gerrit.r=function(){l=1;m();};f.src="javascript:''";f.id=n;f.style.cssText='position:absolute;width:0;height:0;border:none';f.tabIndex=-1;d.body.appendChild(f);f.contentWindow.location.replace(n+'/7209E38C5F54FA2918411884E5DCDFEC.cache.html');d.write('<script defer="defer">gerrit.r()</'+'script>');}gerrit();
//-->
//]]>
</script>
<iframe src="javascript:''" id="gerrit" style="position:absolute;width:0;height:0;border:none" tabindex="-1">
</iframe>
<script defer="defer">
//<![CDATA[
gerrit.r()
//]]>
</script>
</body>
</html>