如何为没有抓取信息的网站返回NaN?

时间:2017-04-12 19:01:40

标签: python python-3.x pandas scrapy

如何为没有任何网址的网址NaN返回".//*[@id='object']//tbody//tr//td//span//a[2]"?我试着:

def parse(self, response):
    links = response.xpath(".//*[@id='object']//tbody//tr//td//span//a[2]")
    if not links:
        item = ToyItem()
        item['link'] = 'NaN'
        item['name'] = response.url
        return item

    for links in links:
        item = ToyItem()
        item['link'] = links.xpath('@href').extract_first()
        item['name'] = response.url  # <-- see here
    yield item

    list_of_dics = []
    list_of_dics.append(item)
    df = pd.DataFrame(list_of_dics)
    print(df)
    df.to_csv('/Users/user/Desktop/crawled_table.csv', index=False)

但是,而不是返回(*)

'link1.com'   'NaN'
'link2.com'   'NAN'
'link3.com'   'extracted3.link.com'

我得到了:

'link3.com'   'extracted3.link.com'

如何退回(*)

1 个答案:

答案 0 :(得分:1)

你可以重做这个以使用scrapy管道:

from scrapy import Spider 
class MySpider(Spider):
    name = 'myspider'
    start_urls = ['link1','link2','link3']

    def parse(self, response):
        links = response.xpath(".//*[@id='object']//tbody//tr//td//span//a[2]")
        if not links:
            item = ToyItem()
            item['link'] = 'NaN'
            item['name'] = response.url
            yield item
        else:
            for links in links:
                item = ToyItem()
                item['link'] = link.xpath('@href').extract_first()
                item['name'] = response.url  # <-- see here
                yield item

现在在pipelines.py

class PandasPipeline:

    def open_spider(self, spider):
        self.data = []

    def process_item(self, item, spider):
        self.data.append(item)
        return item

    def close_spider(self, spider):
        df = pd.DataFrame(self.data)
        print('saving dataframe')
        df.to_csv('/Users/user/Desktop/crawled_table.csv', index=False)

settings.py

ITEM_PIPELINES = {
    'myproject.pipelines.PandasPipeline': 900
}