我正在尝试使用LinkExtractor函数对网站进行爬网以输出特定链接的所有链接。
Scrapy不输出某些网站的链接。例如,如果我尝试此链接https://blog.nus.edu.sg,它似乎可以工作。但不适用于http://nus.edu.sg
所有这些链接都会产生一个可用的网站。我试图查看两个站点的源代码,并且它们在链接到其他站点的方式上看起来很相似
这是我的爬虫
class Crawler(scrapy.Spider):
name = 'all'
def __init__(self, startURL):
self.links=[]
self.start_urls = [startURL]
custom_settings = {
'LOG_LEVEL': logging.WARNING,
'DEPTH_LEVEL': 1
}
def parse(self, response):
le = LinkExtractor()
print(le)
for link in le.extract_links(response):
print(link.url)
使用以下函数调用的地方
def _getLinksDriver(url):
header = {'USER_AGENT': agent} #agent is some user agent previously defined
process = CrawlerProcess(header)
process.crawl(Crawler, url)
process.start(stop_after_crawl=True)
例如,如果我尝试过
_getLinksDriver("http://nus.edu.sg")
输出很简单
2019-06-11 11:42:22 [scrapy.utils.log] INFO: Scrapy 1.6.0 started (bot: scrapybot)
2019-06-11 11:42:22 [scrapy.utils.log] INFO: Versions: lxml 4.3.3.0, libxml2 2.9.9, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 19.2.0, Python 3.6.7 (default, Oct 22 2018, 11:32:17) - [GCC 8.2.0], pyOpenSSL 19.0.0 (OpenSSL 1.1.1b 26 Feb 2019), cryptography 2.6.1, Platform Linux-4.18.0-21-generic-x86_64-with-Ubuntu-18.04-bionic
2019-06-11 11:42:22 [scrapy.crawler] INFO: Overridden settings: {'LOG_LEVEL': 30, 'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'}
<scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor object at 0x7fc45fbbac18>
但是,如果我们导航到实际站点,则显然存在要链接的链接。
尝试_getLinksDriver("https://blog.nus.edu.sg")
可以
2019-06-11 11:38:20 [scrapy.utils.log] INFO: Scrapy 1.6.0 started (bot: scrapybot)
2019-06-11 11:38:20 [scrapy.utils.log] INFO: Versions: lxml 4.3.3.0, libxml2 2.9.9, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 19.2.0, Python 3.6.7 (default, Oct 22 2018, 11:32:17) - [GCC 8.2.0], pyOpenSSL 19.0.0 (OpenSSL 1.1.1b 26 Feb 2019), cryptography 2.6.1, Platform Linux-4.18.0-21-generic-x86_64-with-Ubuntu-18.04-bionic
2019-06-11 11:38:20 [scrapy.crawler] INFO: Overridden settings: {'LOG_LEVEL': 30, 'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'}
<scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor object at 0x7fc4605bcb38>
https://blog.nus.edu.sg#main
https://blog.nus.edu.sg/
http://blog.nus.edu.sg/
https://wiki.nus.edu.sg/display/cit/Blog.nus+Common+Queries
http://help.edublogs.org/user-guide/
https://wiki.nus.edu.sg/display/cit/Blog.nus+Terms+of+Use
https://wiki.nus.edu.sg/display/cit/Blog.nus+Disclaimers
https://blog.nus.edu.sg/wp-signup.php
http://twitter.com/nuscit
http://facebook.com/nuscit
https://blog.nus.edu.sg#scroll-top
http://cyberchimps.com/responsive-theme/
http://wordpress.org/
http://cit.nus.edu.sg/
http://www.nus.edu.sg/
http://www.statcounter.com/wordpress.org/
https://blog.nus.edu.sg#wp-toolbar
https://blog.nus.edu.sg/wp-login.php?redirect_to=https%3A%2F%2Fblog.nus.edu.sg%2F
这是我希望看到的。
我如何在所有网站上使用此功能?
谢谢
如果有帮助,我的Scrapy,Python版本及其所有依赖项
2019-06-11 11:42:12 [scrapy.utils.log] INFO: Scrapy 1.6.0 started (bot: scrapybot)
2019-06-11 11:42:12 [scrapy.utils.log] INFO: Versions: lxml 4.3.3.0, libxml2 2.9.9, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 19.2.0, Python 3.6.7 (default, Oct 22 2018, 11:32:17) - [GCC 8.2.0], pyOpenSSL 19.0.0 (OpenSSL 1.1.1b 26 Feb 2019), cryptography 2.6.1, Platform Linux-4.18.0-21-generic-x86_64-with-Ubuntu-18.04-bionic
答案 0 :(得分:1)
您的代码对以上网站(http://nus.edu.sg/
不起作用的简单原因是Incapsula。
如果您选中response.body
,则会发现以下内容:
Request unsuccessful. Incapsula incident ID: 432001820008199878-98367043303115621
答案 1 :(得分:0)
仅仅是gangabass答案的一个插件(所以请接受他的):
gangabass提到http://nus.edu.sg受Incapsula保护,免受机器人的攻击。
令人毛骨悚然的是这个(curl 'http://nus.edu.sg/'
):
<html>
<head>
<META NAME="robots" CONTENT="noindex,nofollow">
<script src="/_Incapsula_Resource?SWJIYLWA=5074a744e2e3d891814e9a2dace20bd4,719d34d31c8e3a6e6fffd425f7e032f3">
</script>
<body>
</body></html>
实际内容是通过javascript加载的(scrapy不会执行)。如果要执行javascript,可以使用scrapy-splash: https://github.com/scrapy-plugins/scrapy-splash
不幸的是,这更加复杂(但这正是网站所有者想要的)。如果您想友好一点,就不要抓取这些页面(https://blog.scrapinghub.com/2016/08/25/how-to-crawl-the-web-politely-with-scrapy)