这是我蜘蛛的代码:
from scrapy.spiders import Spider
from scrapy.selector import HtmlXPathSelector
from scrapy.http import FormRequest, Request
class BannerSpider(Spider):
name = "Banner"
allowed_domains = ["aus.edu"]
start_urls = ["https://banner.aus.edu/axp3b21h/owa/bwckctlg.p_disp_dyn_ctlg"]
def parse(self, response):
yield FormRequest.from_response(
response,
formxpath='/html/body/div/form',
formdata={'cat_term_in':'201610'},
callback=self.getCoursePages
)
def getCoursePages(self, response):
hxs = HtmlXPathSelector(response)
for category in hxs.select("//select[@id='subj_id']/option//@value").extract():
yield FormRequest.from_response(
response,
formxpath='/html/body/div/form',
formdata={'sel_subj':category, 'sel_levl':'%', 'sel_attr':'%', 'term_in':'201610'},
callback=self.getCourses
)
def getCourses(self, response):
hxs = HtmlXPathSelector(response)
for course in hxs.select("//td[@class='nttitle']/a//@value").extract():
print course
这是输出的一小部分。它一遍又一遍地打印相同的东西。
2015-07-07 02:27:50 [scrapy] DEBUG: Retrying <POST https://banner.aus.edu/axp3b21h/owa/bwckctlg.p_display_courses> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-07-07 02:27:50 [scrapy] DEBUG: Crawled (200) <POST https://banner.aus.edu/axp3b21h/owa/bwckctlg.p_display_courses> (referer: https://banner.aus.edu/axp3b21h/owa/bwckctlg.p_disp_cat_term_date)
2015-07-07 02:27:50 [scrapy] DEBUG: Retrying <POST https://banner.aus.edu/axp3b21h/owa/bwckctlg.p_display_courses> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-07-07 02:27:50 [scrapy] DEBUG: Retrying <POST https://banner.aus.edu/axp3b21h/owa/bwckctlg.p_display_courses> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-07-07 02:27:51 [scrapy] DEBUG: Crawled (200) <POST https://banner.aus.edu/axp3b21h/owa/bwckctlg.p_display_courses> (referer: https://banner.aus.edu/axp3b21h/owa/bwckctlg.p_disp_cat_term_date)
2015-07-07 02:27:51 [scrapy] DEBUG: Retrying <POST https://banner.aus.edu/axp3b21h/owa/bwckctlg.p_display_courses> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-07-07 02:27:51 [scrapy] DEBUG: Crawled (200) <POST https://banner.aus.edu/axp3b21h/owa/bwckctlg.p_display_courses> (referer: https://banner.aus.edu/axp3b21h/owa/bwckctlg.p_disp_cat_term_date)
2015-07-07 02:27:51 [scrapy] DEBUG: Crawled (200) <POST https://banner.aus.edu/axp3b21h/owa/bwckctlg.p_display_courses> (referer: https://banner.aus.edu/axp3b21h/owa/bwckctlg.p_disp_cat_term_date)
我是scrapy的新手,所以我无法理解为什么会这样。我已经尝试过使用DOWNLOAD_DELAY了。它没有帮助。
答案 0 :(得分:0)
我建议您尝试将值{1(默认为 8 )的CONCURRENT_REQUESTS_PER_DOMAIN
与DOWNLOAD_DELAY
结合使用。
当您访问它的客户端不是浏览器时,您正在抓取的网站可能会发生这种情况。要解决此问题,您可以将以下行添加到settings.py
:
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2049.0 Safari/537.36'
您可以阅读有关抓取器阻止的文章:http://webscraping.com/blog/How-to-crawl-websites-without-being-blocked/