我对蜘蛛疗法很新。我正在从www.goodsearch.com提取数据。
下面的代码工作正常,没有任何错误:
import scrapy
class GoodsearchSpider(scrapy.Spider):
name = 'goodsearch'
allowed_domains = ['www.goodsearch.com/coupons/macys']
start_urls = ['http://www.goodsearch.com/coupons/macys/']
#start_urls = ['https://www.goodsearch.com/coupons/shutterfly']
def parse(self, response):
listings = response.xpath('//*[@id="main"]/div[1]/ul/li')
for listing in listings:
coupon_description = listing.xpath('.//span[@class="title"]/text()').extract_first()
coupon_discount1 = listing.xpath('.//div[@class="top"]/text()').extract_first()
coupon_discount2 = listing.xpath('.//div[@class="bottom"]/text()').extract_first()
coupon_type = listing.xpath('.//div[@class="title"]/text()').extract_first()
coupon_expire_data = listing.xpath('.//p/text()').extract_first()
coupon_code = listing.xpath('.//div[1]/div[4]/span[1]/text()').extract_first()
coupon_used_times = listing.xpath('.//span[@class="click-count"]/text()').extract_first()
if coupon_discount1 is not None and coupon_discount2 is not None:
print("")
else:
coupon_discount1 = ""
coupon_discount2 = ""
print(coupon_discount1)
coupon_discount = coupon_discount1 + coupon_discount2
yield {'Coupon Description': coupon_description,
'Coupon Discount': coupon_discount,
'Coupon Type': coupon_type,
'Coupon Expire Data': coupon_expire_data,
'Coupon Code': coupon_code,
'Coupon Used Times': coupon_used_times,
}
如果我通过单个start_url,它就像上面的代码一样正常工作。我想从csv文件中获取带有输入文件的链接。
输入CSV文件(goodsearch_inputfile.csv)
link,store_name
https://www.goodsearch.com/coupons/amazon,Amazon
https://www.goodsearch.com/coupons/target,Target
https://www.goodsearch.com/coupons/bestbuy,BestBuy
每个链接我们必须生成一个csv输出文件,这意味着我们必须生成三个输出文件。你能帮帮我吗。
我添加了以下代码,但没有使用
'''
with open("goodsearch/input_file/goodsearch_inputfile.csv", "r") as links:
for link in links:
url, name = link.strip().split('|')
start_urls = [url.strip()]
fname = name
print '----------------------------------'
print 'name: {}, start urls: {}'.format(fname, start_urls) '''
答案 0 :(得分:0)
为什么不将csv文件加载到numpy narray而不是使用split和regular文件。你应该利用csv文件的有组织结构。