由于某些原因,scrapy正在拒绝规则中的URL解析数据: 我正在从包含/ browse /,/ search /,/ ip /。的网址中获取解析数据 我不确定这是哪里出错的。
请指教,谢谢!请在下面找到我的代码:
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from wallspider.items import Website
class mydomainSpider(CrawlSpider):
name = "tp"
allowed_domains = ["www.mydomain.com"]
start_urls = ["http://www.mydomain.com",]
"""/tp/ page type to crawl"""
rules = (Rule (SgmlLinkExtractor(allow=('/tp/', ),
deny=(
'browse/',
'browse-ng.do?',
'search-ng.do?',
'facet=',
'ip/',
'page/'
'search/',
'/[1-9]$',
'(bti=)[1-9]+(?:\.[1-9]*)?',
'(sort_by=)[a-zA-Z]',
'(sort_by=)[1-9]+(?:\.[1-9]*)?',
'(ic=32_)[1-9]+(?:\.[1-9]*)?',
'(ic=60_)[0-9]+(?:\.[0-9]*)?',
'(search_sort=)[1-9]+(?:\.[1-9]*)?', )
,)
, callback="parse_items", follow= True),
)
def parse_items(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//html')
items = []
for site in sites:
item = Website()
item['referer'] = response.request.headers.get('Referer')
item['url'] = response.url
item['title'] = site.xpath('/html/head/title/text()').extract()
item['description'] = site.select('//meta[@name="Description"]/@content').extract()
items.append(item)
return items
我的控制台日志的一部分,它的grabing / ip / pages?:
2013-12-11 11:21:43-0800 [tp] DEBUG: Crawled (200) <GET http://www.mydomain.com/ip/1104329> (referer: http://www.mydomain.com/tp/john-duigan)
2013-12-11 11:21:43-0800 [tp] DEBUG: Scraped from <200 http://www.mydomain.com/ip/1104329>
{'description': [u'Shop Low Prices on: Molly (Widescreen) : Movies'],
'referer': 'http://www.mydomain.com/tp/john-duigan',
'title': [u'Molly (Widescreen): Movies : mydomain.com '],
'url': 'http://www.mydomain.com/ip/1104329'}
2013-12-11 11:21:43-0800 [tp] DEBUG: Redirecting (302) to <GET http://www.mydomain.com/ip/17371019> from <GET http://www.mydomain.com/tp/jon-furmanski>
2013-12-11 11:21:43-0800 [tp] DEBUG: Redirecting (302) to <GET http://www.mydomain.com/ip/17371019> from <GET http://www.mydomain.com/tp/taylor-byrd>
2013-12-11 11:21:43-0800 [tp] DEBUG: Redirecting (302) to <GET http://www.mydomain.com/ip/17371019> from <GET http://www.mydomain.com/tp/greg-byers>
2013-12-11 11:21:43-0800 [tp] DEBUG: Redirecting (302) to <GET http://www.mydomain.com/ip/17371019> from <GET http://www.mydomain.com/tp/tom-bowker>
2013-12-11 11:21:43-0800 [tp] DEBUG: Crawled (200) <GET http://www.mydomain.com/ip/21152221> (referer: http://www.mydomain.com/tp/peter-levin)
2013-12-11 11:21:43-0800 [tp] DEBUG: Scraped from <200 http://www.mydomain.com/ip/21152221>
{'description': [u'Shop Low Prices on: Marva Collins Story (1981) : Video on Demand by VUDU'],
'referer': 'http://www.mydomain.com/tp/peter-levin',
'title': [u'Marva Collins Story (1981): Video on Demand by VUDU : mydomain.com '],
'url': 'http://www.mydomain.com/ip/21152221'}
答案 0 :(得分:4)
从网页中提取链接时,SgmlLinkExtractor
的规则适用。在您的情况下,您的一些.../tp/...
请求被重定向到.../ip/...
个页面。
Redirecting (302) to <GET http://www.mydomain.com/ip/17371019> from <GET http://www.mydomain.com/tp/tom-bowker>
allow
和deny
模式在重定向后不适用于网址。
您可以通过将REDIRECT_ENABLED设置为False
来完全禁用以下重定向(请参阅RedirectMiddleware)
答案 1 :(得分:0)
我发现错误,页面重定向到我拒绝规则中的页面类型。谢谢你的帮助!我很感激!