有些人可以帮助我使用scrapy请求类
发出请求我试过这个,但它不起作用:
from scrapy.selector import HtmlXPathSelector
from scrapy.http.request import Request
url = 'http://www.fetise.com'
a = Request(url)
hxs = HtmlXPathSelector(a)
错误是:
Traceback (most recent call last):
File "sa.py", line 83, in <module>
hxs = HtmlXPathSelector(a)
File "/usr/local/lib/python2.7/dist-packages/scrapy/selector/lxmlsel.py",line 31,in __init__
_root = LxmlDocument(response, self._parser)
File "/usr/local/lib/python2.7/dist-packages/scrapy/selector/lxmldocument.py",line 27,in __new__
cache[parser] = _factory(response, parser)
File "/usr/local/lib/python2.7/dist-packages/scrapy/selector/lxmldocument.py",line 13, in _factory
body = response.body_as_unicode().strip().encode('utf8') or '<html/>'AttributeError: 'Request' object has no attribute 'body_as_unicode'`
我知道回调..实际上我首先要从网站上删除网址,然后将它们用作起始网址....
答案 0 :(得分:1)
请试试这个:
import urllib
from scrapy.selector import HtmlXPathSelector
from pprint import pprint
url = 'http://www.fetise.com'
data = urllib.urlopen(url).read()
hxs = HtmlXPathSelector(text=data)
lista = hxs.select('//ul[@class="categoryMenu"]/li/ul/li/a/@href').extract()
acb = ["http://www.fetise.com/" + i if "http://www.fetise.com/" not in i else i for i in lista] + [u"http://www.fetise.com/sale"]
pprint(acb)
这是输出:
[u'http://www.fetise.com/apparel/shirts',
u'http://www.fetise.com/apparel/tees',
u'http://www.fetise.com/apparel/tops-and-tees',
u'http://www.fetise.com/accessories/belts',
u'http://www.fetise.com/accessories/cufflinks',
u'http://www.fetise.com/accessories/jewellery',
u'http://www.fetise.com/accessories/lighters',
u'http://www.fetise.com/accessories/others',
u'http://www.fetise.com/accessories/sunglasses',
u'http://www.fetise.com/accessories/ties-cufflinks',
u'http://www.fetise.com/accessories/wallets',
u'http://www.fetise.com/accessories/watches',
u'http://www.fetise.com/footwear/boots',
u'http://www.fetise.com/footwear/casual',
u'http://www.fetise.com/footwear/flats',
u'http://www.fetise.com/footwear/heels',
u'http://www.fetise.com/footwear/loafers',
u'http://www.fetise.com/footwear/sandals',
u'http://www.fetise.com/footwear/shoes',
u'http://www.fetise.com/footwear/slippers',
u'http://www.fetise.com/footwear/sports',
u'http://www.fetise.com/innerwear/boxers',
u'http://www.fetise.com/innerwear/briefs',
u'http://www.fetise.com/personal-care/deos',
u'http://www.fetise.com/personal-care/haircare',
u'http://www.fetise.com/personal-care/perfumes',
u'http://www.fetise.com/personal-care/personal-care',
u'http://www.fetise.com/personal-care/shavers',
u'http://www.fetise.com/apparel/tees/gifts-for-her',
u'http://www.fetise.com/footwear/sandals/gifts-for-her',
u'http://www.fetise.com/footwear/shoes/gifts-for-her',
u'http://www.fetise.com/footwear/heels/gifts-for-her',
u'http://www.fetise.com/footwear/flats/gifts-for-her',
u'http://www.fetise.com/footwear/ballerinas/gifts-for-her',
u'http://www.fetise.com/footwear/loafers/gifts-for-her',
u'http://www.fetise.com/sale']
答案 1 :(得分:0)
文档建议您在请求完成时需要传入callback。回调将有权访问响应对象:
来自文档:
将附加数据传递给回调函数¶回调函数 request是一个函数,在响应时将被调用 请求已下载。回调函数将被调用 下载了Response对象作为其第一个参数。
示例:
def parse_page1(self, response):
return Request("http://www.example.com/some_page.html",
callback=self.parse_page2)
def parse_page2(self, response):
# this would log http://www.example.com/some_page.html
self.log("Visited %s" % response.url)