我想使用Scrapy抓取一些非常大的网站的部分内容。例如,来自northeastern.edu我只想抓取网址 mov eax, SYSCALL_WRITE ; write message
mov ebx, STDOUT
mov ecx, msg7
mov edx, len7
int 080h
mov eax, SYSCALL_WRITE ; write user input
mov ebx, STDOUT
mov ecx, finalNum
mov edx, BUFLENFINAL
int 080h
下方的网页,例如http://www.northeastern.edu/financialaid/
或http://www.northeastern.edu/financialaid/contacts
。我不想刮掉大学的整个网站,即不应该允许http://www.northeastern.edu/financialaid/faq
。
我对http://www.northeastern.edu/faq
格式的网址没有任何问题(只需将allowed_domains限制为financialaid.northeastern.edu
),但同样的策略不适用于financialaid.northeastern.edu
。 (整个蜘蛛代码实际上更长,因为它循环遍历不同的网页,我可以提供细节。一切都与规则不同。)
northestern.edu/financialaid
结果如下:
import scrapy
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from test.items import testItem
class DomainSpider(CrawlSpider):
name = 'domain'
allowed_domains = ['northestern.edu/financialaid']
start_urls = ['http://www.northestern.edu/financialaid/']
rules = (
Rule(LxmlLinkExtractor(allow=(r"financialaid/",)), callback='parse_item', follow=True),
)
def parse_item(self, response):
i = testItem()
#i['domain_id'] = response.xpath('//input[@id="sid"]/@value').extract()
#i['name'] = response.xpath('//div[@id="name"]').extract()
#i['description'] = response.xpath('//div[@id="description"]').extract()
return i
我尝试的第二个策略是使用LxmlLinkExtractor的allow-rules并限制爬网到子域内的所有内容,但在这种情况下整个网页都被删除。 (拒绝规则确实有效。)
2015-05-12 14:10:46-0700 [scrapy] INFO: Scrapy 0.24.4 started (bot: finaid_scraper)
2015-05-12 14:10:46-0700 [scrapy] INFO: Optional features available: ssl, http11
2015-05-12 14:10:46-0700 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'finaid_scraper.spiders', 'SPIDER_MODULES': ['finaid_scraper.spiders'], 'FEED_URI': '/Users/hugo/Box Sync/finaid/ScrapedSiteText_check/Northeastern.json', 'USER_AGENT': 'stanford_sociology', 'BOT_NAME': 'finaid_scraper'}
2015-05-12 14:10:46-0700 [scrapy] INFO: Enabled extensions: FeedExporter, LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2015-05-12 14:10:46-0700 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-05-12 14:10:46-0700 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-05-12 14:10:46-0700 [scrapy] INFO: Enabled item pipelines:
2015-05-12 14:10:46-0700 [graphspider] INFO: Spider opened
2015-05-12 14:10:46-0700 [graphspider] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-05-12 14:10:46-0700 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-05-12 14:10:46-0700 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2015-05-12 14:10:46-0700 [graphspider] DEBUG: Redirecting (301) to <GET http://www.northeastern.edu/financialaid/> from <GET http://www.northeastern.edu/financialaid>
2015-05-12 14:10:47-0700 [graphspider] DEBUG: Crawled (200) <GET http://www.northeastern.edu/financialaid/> (referer: None)
2015-05-12 14:10:47-0700 [graphspider] DEBUG: Filtered offsite request to 'assistive.usablenet.com': <GET http://assistive.usablenet.com/tt/http://www.northeastern.edu/financialaid/index.html>
2015-05-12 14:10:47-0700 [graphspider] DEBUG: Filtered offsite request to 'www.northeastern.edu': <GET http://www.northeastern.edu/financialaid/index.html>
2015-05-12 14:10:47-0700 [graphspider] DEBUG: Filtered offsite request to 'www.facebook.com': <GET http://www.facebook.com/pages/Boston-MA/NU-Student-Financial-Services/113143082891>
2015-05-12 14:10:47-0700 [graphspider] DEBUG: Filtered offsite request to 'twitter.com': <GET https://twitter.com/NUSFS>
2015-05-12 14:10:47-0700 [graphspider] DEBUG: Filtered offsite request to 'nusfs.wordpress.com': <GET http://nusfs.wordpress.com/>
2015-05-12 14:10:47-0700 [graphspider] DEBUG: Filtered offsite request to 'northeastern.edu': <GET http://northeastern.edu/howto>
2015-05-12 14:10:47-0700 [graphspider] INFO: Closing spider (finished)
2015-05-12 14:10:47-0700 [graphspider] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 431,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 9574,
'downloader/response_count': 2,
'downloader/response_status_count/200': 1,
'downloader/response_status_count/301': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 5, 12, 21, 10, 47, 94112),
'log_count/DEBUG': 10,
'log_count/INFO': 7,
'offsite/domains': 6,
'offsite/filtered': 32,
'request_depth_max': 1,
'response_received_count': 1,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2015, 5, 12, 21, 10, 46, 566538)}
2015-05-12 14:10:47-0700 [graphspider] INFO: Spider closed (finished)
我也尝试过:
import scrapy
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from test.items import testItem
class DomainSpider(CrawlSpider):
name = 'domain'
allowed_domains = ['www.northestern.edu']
start_urls = ['http://www.northestern.edu/financialaid/']
rules = (
Rule(LxmlLinkExtractor(allow=(r"financialaid/",)), callback='parse_item', follow=True),
)
def parse_item(self, response):
i = testItem()
#i['domain_id'] = response.xpath('//input[@id="sid"]/@value').extract()
#i['name'] = response.xpath('//div[@id="name"]').extract()
#i['description'] = response.xpath('//div[@id="description"]').extract()
return i
日志太长,无法在此处发布,但这些行显示Scrapy忽略了allow-rule:
rules = (
Rule(LxmlLinkExtractor(allow=(r"northeastern.edu/financialaid",)), callback='parse_site', follow=True),
)
这是我的items.py:
2015-05-12 14:26:06-0700 [graphspider] DEBUG: Crawled (200) <GET http://www.northeastern.edu/camd/journalism/2014/10/07/prof-leff-talks-american-press-holocaust/> (referer: http://www.northeastern.edu/camd/journalism/2014/10/07/prof-schroeder-quoted-nc-u-s-senate-debates-charlotte-observer/)
2015-05-12 14:26:06-0700 [graphspider] DEBUG: Crawled (200) <GET http://www.northeastern.edu/camd/journalism/tag/north-carolina/> (referer: http://www.northeastern.edu/camd/journalism/2014/10/07/prof-schroeder-quoted-nc-u-s-senate-debates-charlotte-observer/)
2015-05-12 14:26:06-0700 [graphspider] DEBUG: Scraped from <200 http://www.northeastern.edu/camd/journalism/2014/10/07/prof-leff-talks-american-press-holocaust/>
我使用的是Mac,Python 2.7,Scrapy版本0.24.4。之前已经发布了类似的问题,但没有一个建议的解决方案解决了我的问题。
答案 0 :(得分:2)
您在蜘蛛内使用的网址中有拼写错误,请参阅:
northeastern
VS
northestern
这是为我工作的蜘蛛(它仅遵循“financialaid”链接):
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
class DomainSpider(CrawlSpider):
name = 'domain'
allowed_domains = ['northeastern.edu']
start_urls = ['http://www.northeastern.edu/financialaid/']
rules = (
Rule(LinkExtractor(allow=r"financialaid/"), callback='parse_item', follow=True),
)
def parse_item(self, response):
print response.url
请注意,我使用LinkExtractor
快捷方式和allow
参数值的字符串。
我也编辑了你的问题并修复了缩进问题,假设它们只是“发布”问题。