Question

我接近用scrapy完成爬虫。爬虫应该从page1收集和提取数据，包括链接，然后它产生新的请求（对于第1页上收集的所有链接）到page2，再次收集并提取所有必需的数据并将请求（所有链接再次）发送到page3，从中收集数据第3页和结束。我有3个输出文件，有数百万这样的页面（page1，page2，page3）。现在，当我在单个url的方法中使用return语句时一切正常。一切都编码正确，在我开始提出请求之前它运作良好。我知道错误仍然存在，但我不确定它何时发生，为什么？这是错误：

ValueError: Missing scheme in request url:

之所以发生这种情况是因为我提出了一个没有URL的请求，整个列表都是空的：

  2014-03-31 18:32:47+0200 [ah-h] WARNING: Dropped:
    {'gif_url': '',
     'link_referal': '',
     'model_id': '',
     'parts_name': '',
     'parts_number': ''}

所以我的代码工作正常，只要我不产生新的Request（），这是我的代码：

def parse_part_list(self,response):
    hxs = HtmlXPathSelector(response)
    s = hxs.select('//tr')
    items = []
    for part in s:
        item = PartsItem()
        item['link_referal'] = "".join(part.select('td[@class="sel_sec_sub_section_even_col_2"]/a/@href | td[@class="sel_sec_sub_section_odd_col_2"]/a/@href').extract())            
        item['model_id'] = "".join(part.select('td[@class="sel_sec_sub_section_even_col_2"]/a/@href | td[@class="sel_sec_sub_section_odd_col_2"]/a/@href').extract())
        item['parts_number'] = "".join(part.select('td[@class="sel_sec_sub_section_even_col_2"]/a/@href | td[@class="sel_sec_sub_section_odd_col_2"]/a/@href').extract())
        item['parts_name'] = "".join(part.select('td[@class="sel_sec_sub_section_even_col_2"]/a/text() | td[@class="sel_sec_sub_section_odd_col_2"]/a/text()').extract())
        item['gif_url'] = "".join(part.select('td/@onmouseover').extract())
        yield item
        yield Request(item['link_referal'], callback=self.parse_frames)
        #items.append(item)
    #return items

现在，如果我发表评论：

#yield Request(item['link_referal'], callback=self.parse_frames)

一切都会正常工作，数据会被删除并提取到我的csv文件中，当然会删除空行。

空项目在pipelines.py中过滤掉，这是代码的一部分：

if isinstance(item, items.PartsItem):
        #check if expanded data is blank line
        if not(all(item.values())):
            raise DropItem()
        else:
            item['model_id'] = "".join(get_par_from_url(item['model_id'],'a'))
            item['parts_number'] = "".join(get_par_from_url(item['parts_number'],'b'))
            self.partsCsv.writerow([item['link_referal'],item['model_id'],item['parts_number'],item['parts_name'],item['gif_url']])
            return item

从代码中我可以看到我使用DropItem（）删除空项（我的csv文件中的空行）但是由于一些奇怪的原因，这部分代码会在我发出新请求后删除发送到我的管道的所有项目？有什么建议吗？

scrapy，在csv中导致空白的空项目

0 个答案: