我正在尝试抓一个网页,但它要求我先登录。我是网络抓取的新手,所以请耐心等待我的代码:
$conn = sasql_connect("HOST=host:port;DBN=dbn;UID=uid;PWD=pwd");
if (!$conn) {echo "Connection failed."; die(); }
$highest_id = -1;
$num_rows_retrieved = 0;
do {
if (!sasql_real_query($conn, "SELECT TOP 32767 * FROM dba.anytable where anytable_id > $highest_id order by anytable_id")) {
echo "Query failed.";
die();
}
$result = sasql_use_result($conn);
if (!$result) {
echo "No result set.";
die();
}
$num_rows_retrieved = 0;
$num_fields = sasql_num_fields($result);
while ($row = sasql_fetch_row($result)) {
$highest_id = $row[0]; // assumes anytable_id is the first field
$i = 0;
while ($i < $num_fields) {
echo "$row[$i]\t";
$i++;
}
$num_rows_retrieved++;
echo "\n";
}
sasql_free_result($result);
} while ($num_rows_retrieved == 32767);
sasql_disconnect($conn);
但是我收到了这个错误:
import urllib
import urllib2
from bs4 import BeautifulSoup
import mechanize
browser = mechanize.Browser()
browser.addheaders = [('User-agent', 'Mozilla/5.0')]
browser.set_handle_robots(False)
browser.open('https://mywebsite.com')
# browser.select_form(name = 'form2')
# browser.form['Account Name'] = 'username'
# browser.form['Password'] = 'mypassword'
# browser.submit()
soup = BeautifulSoup(browser.response().read())
print soup
答案 0 :(得分:0)
请尝试使用以下标头,服务器可能无法识别您的标头,因此可能会导致它认为您没有启用JavaScript:
Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.124 Safari/537.36
注意:有些网站有防刮保护,您必须解决javascript难题才能获得实际内容。您可以将Js2Py用于该任务或任何其他javascript运行时。刮这类网站要困难得多,但幸运的是很少有网站使用这个系统。