我正在尝试抓取网站
此网站在正文中返回Error, query failed
。然后我点击Find USED
标签。然后我点击搜索以获得结果。搜索按钮实际上是在发布帖子并获取数据。
这是我的蜘蛛:
def start_requests(self):
for url in self.start_urls:
yield self.make_requests_from_url(url)
def make_requests_from_url(self, url):
return Request(url,cookies={'PHPSESSID':'0a94ce3bf2484d5102a047b86f5b6c17','__utm':'154876456.1461047540.1397668365.1397668365.1397668365.1',', callback=self.page_parse)
def parse(self,response):
sel = Selector(response)
print sel
我收到了这个回复:
2014-04-16 21:04:27+0300 [XXX] DEBUG: Crawled (403) <POST http://website> (referer: None)
请问我做错了什么?
当我点击搜索按钮时,我分析了请求,这是请求
http://www.autodealer.ae/plugins/ad/buy.php?q=used+cars+dubai
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:28.0) Gecko/20100101 Firefox/28.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Cookie: PHPSESSID=0a94ce3bf2484d5102a047b86f5b6c17;
Connection: keep-alive
Content-Type: application/x-www-form-urlencoded
Content-Length: 131
我做错了什么?
或者如何抓取该网站?
我的蜘蛛做错了吗?答案 0 :(得分:1)
该网站需要对某些种类进行身份验证,并且您正在为您的请求对象提供虚假的PHPSESSID
Cookie。您的Python代码应首先进行身份验证,然后继续向网站发送请求。
--------已编辑----------
发布到该网址会导致403错误。
$ curl -X POST“ SITE EDITED ”
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>403 Forbidden</title>
</head><body>
<h1>Forbidden</h1>
<p>You don't have permission to access /plugins/ad/buy.php
on this server.</p>
<p>Additionally, a 404 Not Found
error was encountered while trying to use an ErrorDocument to handle the request.
</p>
</body></html>