我根据各种索引页面生成项目列表。我有一个start_url和一个xpath规则列表:
def parse(self,response):
sel = Selector(response)
sites = sel.xpath('//tbody/tr')
items = []
for site in sites:
item = EvolutionmItem()
item['title'] = site.xpath('td/div[not(contains(., "Sticky:") or contains(.,"ANNOUNCEMENT"))]/a[contains(@id,"thread_title")]/text()').extract()
item['url'] = site.xpath('td[contains(@id,"threadtitle")]/div/a[contains(@href,"http://forums.evolutionm.net/sale-engine-drivetrain-power/")]/@href').extract()
item['poster'] = site.xpath('td[contains(@id,"threadtitle")]/div[@class="smallfont"]/span/text()').extract()
item['status'] = site.xpath('td[contains(@id,"threadtitle")]/div/span[contains(@class,"highlight")]').extract()
items.append(item)
return items
此代码没有错误,并准确提取我需要的内容。现在,我想访问每个网址,并从这些网址中提取其他数据。
最好的方法是什么?我似乎无法让request.meta正常工作。
修改
Girish的解决方案是正确的,但为了让它发挥作用,我必须确保我的item['url']
不是空的:
for site in sites:
item = EvolutionmItem()
...
if item['url']:
yield Request(item['url'][0],meta={'item':item},callback=self.thread_parse)
答案 0 :(得分:2)
您需要使用网址,元和回调参数生成请求对象。
def parse(self,response):
sel = Selector(response)
sites = sel.xpath('//tbody/tr')
for site in sites:
item = EvolutionmItem()
item['title'] = site.xpath('td/div[not(contains(., "Sticky:") or contains(.,"ANNOUNCEMENT"))]/a[contains(@id,"thread_title")]/text()').extract()
item['url'] = u''. join( site.xpath('td[contains(@id,"threadtitle")]/div/a[contains(@href,"http://forums.evolutionm.net/sale-engine-drivetrain-power/")]/@href').extract())
item['poster'] = site.xpath('td[contains(@id,"threadtitle")]/div[@class="smallfont"]/span/text()').extract()
item['status'] = site.xpath('td[contains(@id,"threadtitle")]/div/span[contains(@class,"highlight")]').extract()
yield Request(url = item['url'], meta = {'item': item}, callback=self.parse_additional_info)
def parse_additional_info(self, response):
#extract additional info
yield item