我正在尝试获取/查看某个网站的HTML代码但是我无法这样做,下面是我的代码片段和我得到的输出
import mechanize
from BeautifulSoup import BeautifulSoup
page = browser.open('https://www.My-Website.com')
source_code = page.read()
print source_code
try:
soup = BeautifulSoup(source_code)
images = soup.findAll('img')
输出
<html><head><meta http-equiv="Pragma" content="no-cache"/>
<meta http-equiv="Expires" content="-1"/>
<meta http-equiv="CacheControl" content="no-cache"/>
<script type="text/javascript">
function decode_string(in_str) { return decodeURIComponent(in_str); }
function test(){var table = "00000000 77073096 EE0E612C 990951BA 076DC419 706AF48F
...
var c = 1354021302
var slt = "iM78ylW5"
...
for (var i=0; i<n; i++)
arr[i] = s1;
for (var i=0; i<m-1; i++){
for(var j=n-1; j>=0;--j) {
var t = arr[j].charCodeAt(0);
t++; arr[j] = String.fromCharCode(t);
if (arr[j].charCodeAt(0)<=end) {
break;} else { arr[j] = s1 ;}}
var chlg = arr.join(""); var str = chlg + slt;
var crc = 0;
var crc = crc ^ (-1);
for( var k = 0, iTop = str.length; k < iTop; k++ ){ crc = (crc >> 8) ^ ("0x" table.substr(((crc ^ str.charCodeAt(k) ) & 0x000000FF) * 9, 8));}
document.cookie = "TSd58639_75=" + "a26b965e6341773f64088b486e1859bf:" + chlg + ":" + slt + ":" + crc + ";Max-Age=3600;path=/";
document.forms[0].elements[2].value=decode_string(document.forms[0].elements[2].value)
document.forms[0].elements[4].value=decode_string(document.forms[0].elements[4].value)
if (document.forms[0].attributes['action']!=undefined) { document.forms[0].attributes['action'].value = decode_string(document.forms[0].attributes['action'].value) ;} else {document.forms[0].action = decode_string(document.forms[0].action);}
document.forms[0].submit();}
</script>
</head>
<body onload="test()">
<form method="POST" action="%2f">
</form>
<input type="hidden" name="TSd58639_id" value="3" />
<input type="hidden" name="TSd58639_md" value="1" />
<input type="hidden" name="TSd58639_rf" value="0" />
<input type="hidden" name="TSd58639_ct" value="0" />
<input type="hidden" name="TSd58639_pd" value="0" />
</body>
</html>
UPDATE 我使用tidylib作为我的脚本,如下所示
document, errors = tidy_document(source_code)
print document
print errors
错误的输出是: -
line 1 column 1 - Warning: missing <!DOCTYPE> declaration
line 35 column 159 - Warning: <input> isn't allowed in <body> elements
line 35 column 45 - Info: <body> previously mentioned
line 35 column 159 - Warning: inserting implicit <form>
line 1 column 7 - Warning: inserting missing 'title' element
line 35 column 159 - Warning: <form> lacks "action" attribute
注意
- 当我使用FF或其他浏览器查看页面源代码时,我注意到了这个消息
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
-i尝试使用最新的beautifulsoup4 4.3.1但得到了相同的结果
-i也会在请求库中获得相同的结果。
- 看起来与这篇文章相似 Python unable to retrieve form with urllib or mechanize但是没有与我合作
请帮忙
答案 0 :(得分:0)
好的,这是一个java问题,同样的代码使用python selenium。感谢