每个start_url我们都要创建一个输出CSV文件

时间:2018-06-05 12:51:18

标签: python web-scraping web-crawler scrapy-spider

我对蜘蛛疗法很新。我正在从www.goodsearch.com提取数据。

  

下面的代码工作正常,没有任何错误:

import scrapy
class GoodsearchSpider(scrapy.Spider):
    name = 'goodsearch'
    allowed_domains = ['www.goodsearch.com/coupons/macys']
    start_urls = ['http://www.goodsearch.com/coupons/macys/']
    #start_urls = ['https://www.goodsearch.com/coupons/shutterfly']


    def parse(self, response):
        listings = response.xpath('//*[@id="main"]/div[1]/ul/li')
        for listing in listings:
            coupon_description = listing.xpath('.//span[@class="title"]/text()').extract_first()
            coupon_discount1 = listing.xpath('.//div[@class="top"]/text()').extract_first()
            coupon_discount2 = listing.xpath('.//div[@class="bottom"]/text()').extract_first()
            coupon_type = listing.xpath('.//div[@class="title"]/text()').extract_first()
            coupon_expire_data = listing.xpath('.//p/text()').extract_first()
            coupon_code = listing.xpath('.//div[1]/div[4]/span[1]/text()').extract_first()
            coupon_used_times = listing.xpath('.//span[@class="click-count"]/text()').extract_first()

            if coupon_discount1 is not None and coupon_discount2 is not None:
                print("")
            else:
                coupon_discount1 = ""
                coupon_discount2 = ""
                print(coupon_discount1)
            coupon_discount = coupon_discount1 + coupon_discount2


            yield {'Coupon Description': coupon_description,
                   'Coupon Discount': coupon_discount,
                   'Coupon Type': coupon_type,
                   'Coupon Expire Data': coupon_expire_data,
                   'Coupon Code': coupon_code,
                   'Coupon Used Times': coupon_used_times,
                   }

如果我通过单个start_url,它就像上面的代码一样正常工作。我想从csv文件中获取带有输入文件的链接。

  

输入CSV文件(goodsearch_inputfile.csv)

link,store_name    
https://www.goodsearch.com/coupons/amazon,Amazon
https://www.goodsearch.com/coupons/target,Target
https://www.goodsearch.com/coupons/bestbuy,BestBuy

每个链接我们必须生成一个csv输出文件,这意味着我们必须生成三个输出文件。你能帮帮我吗。

  

我添加了以下代码,但没有使用

'''    
    with open("goodsearch/input_file/goodsearch_inputfile.csv", "r") as links:
        for link in links:
            url, name = link.strip().split('|')
            start_urls = [url.strip()]
            fname = name
            print '----------------------------------'
            print 'name: {}, start urls: {}'.format(fname, start_urls) '''

1 个答案:

答案 0 :(得分:0)

为什么不将csv文件加载到numpy narray而不是使用split和regular文件。你应该利用csv文件的有组织结构。