我在Python中使用Scrapy创建了Web爬虫。但它似乎并不像我想的那样有效。如果有任何帮助,我将不胜感激。
Spider正在用汽车(http://allegro.pl/osobowe-volkswagen-4055?buyUsed=1
)抓取一个网站。我有递归爬行的问题(以下链接)。我希望我的蜘蛛遵循子类别,直到达到没有其他子类别的类别。这样:Volkswagen-> Golf ->I (1974-1983)
对于每个点头我想要有名字,汽车数量,如果有子节点,那么链接到它,并作为一个'子类别'一个对象,具有相同的子节点信息。
items.py: 进口scrapy
class Allegro3Item(scrapy.Item):
name=scrapy.Field()
count=scrapy.Field()
url = scrapy.Field()
subcategory= scrapy.Field()
蜘蛛代码:
import scrapy
from allegro3.items import Allegro3Item
linki=[]
class AlegroSpider(scrapy.Spider):
name = "AlegroSpider"
allowed_domains = ["allegro.pl"]
#start_urls = ["http://allegro.pl/samochody-osobowe-4029?buyUsed=1"]
#start_urls=["http://allegro.pl/osobowe-buick-18092?buyUsed=1"]
#start_urls=["http://allegro.pl/osobowe-alfa-romeo-4030?buyUsed=1"]
start_urls=['http://allegro.pl/osobowe-volkswagen-4055?buyUsed=1']
def parse(self, response):
global linki
if response.url not in linki:
#add links to the list
linki.append(response.url)
# if it is not dead end
if len(response.xpath('//*[@id="sidebar-categories"]/div/nav/ul/li').extract()) == \
len(response.xpath('//*[@id="sidebar-categories"]/div/nav/ul/li/a').extract()):
for de in response.xpath('//*[@id="sidebar-categories"]/div/nav/ul/li'):
#create iteam
la = Allegro3Item()
link = de.xpath('a/@href').extract()
# populate iteam
la['name'] = de.xpath('a/span/span/text()').extract()[0].encode('utf-8')
la['count'] = de.xpath('span/text()').extract()[0].encode('utf-8')
la['url'] = response.urljoin(link[0]).encode('utf-8')
la['subcategory']=[]
#Go to subnodes
if la['url'] is not None:
if la['url'] not in linki:
linki.append(la['url'])
#setting request
request = scrapy.Request(la['url'],callback=self.SearchFurther)
request.meta['la'] = la
yield request
def SearchFurther(self,response):
global linki
# get metadata from request
la = response.meta['la']
#check if this is not deadend: lower number of category than number of links
if len(response.xpath('//*[@id="sidebar-categories"]/div/nav/ul/li').extract()) == \
len(response.xpath('//*[@id="sidebar-categories"]/div/nav/ul/li/a').extract()):
#go further
for de in response.xpath('//*[@id="sidebar-categories"]/div/nav/ul/li'):
#populate subnode
la2 = Allegro3Item()
la2['name'] = de.xpath('a/span/span/text()').extract()[0].encode('utf-8')
la2['count'] = de.xpath('span/text()').extract()[0].encode('utf-8')
link = de.xpath('a/@href').extract()
la2['url'] = response.urljoin(link[0]).encode('utf-8')
la2['subcategory']=[]
# if there is subnode, looping
if la2['url'] is not None:
if la2['url'] not in linki:
linki.append(la2['url'])
#make recursive request to get sub/sub nodes
request = scrapy.Request(la2['url'],callback=self.SearchFurther)
request.meta['la']=la2
#attach response as subcategory
la2['subcategory'].append(request)
la['subcategory'].append(la2)
else:
la['subcategory']=[]
yield la
示例输出:
{"count": 509, "url": "http://allegro.pl/volkswagen-transporter-83759?buyUsed=1", "name": "Transporter",
"subcategory":
[{"count": "(6)", "url": "http://allegro.pl/transporter-t2-83761?buyUsed=1", "name": " T2 "
, "subcategory": ["<Request GET http://allegro.pl/transporter-t2-83761?buyUsed=1>"]}
, {"count": "(14)", "url": "http://allegro.pl/transporter-t3-83762?buyUsed=1", "name": " T3 "
, "subcategory": ["<Request GET http://allegro.pl/transporter-t3-83762?buyUsed=1>"]}
, {"count": "(231)", "url": "http://allegro.pl/transporter-t4-83763?buyUsed=1", "name": " T4 "
, "subcategory": ["<Request GET http://allegro.pl/transporter-t4-83763?buyUsed=1>"]}
, {"count": "(256)", "url": "http://allegro.pl/transporter-t5-83764?buyUsed=1", "name": " T5 "
, "subcategory": ["<Request GET http://allegro.pl/transporter-t5-83764?buyUsed=1>"]}]},
所以我理解在递归请求的那一刻之前一切都很顺利(第79行,当我使用SearchFurther回调并希望再次使用它来废弃子类别时),因为我得到了
"subcategory": ["<Request GET http://allegro.pl/transporter-t2-83761?buyUsed=1>"]}
遗赠物品:line 90: yield la
如果你怀疑我做错了什么,我会很感激你的帮助。