我使用Regex从多个URL的脚本标签中检索数据。我有一个csv文件('links.csv'),其中包含我需要抓取的所有网址。我设法读取了csv,并将所有网址存储在名为“ start_urls”的变量中。我的问题是我需要一种方法可以一次从“ start_urls”中读取URL,然后执行我的代码的下一部分。当我在终端中执行代码时,它返回2个错误:
1. for pvi_subtype_name,pathIndicator.depth_5,model_name in zip(source): ValueError: not enough values to unpack (expected 3, got 1)
2. source = response.xpath("//script[contains(., 'COUNTRY_SHOP_STATUS')]/text()").extract()[0] IndexError: list index out of range
以下是我存储在初始csv('links.csv')中的url的一些示例:
"https://www.samsung.com/uk/smartphones/galaxy-note8/"
"https://www.samsung.com/uk/smartphones/galaxy-s8/"
"https://www.samsung.com/uk/smartphones/galaxy-s9/"
这是我的代码:
import scrapy
import csv
import re
class QuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
with open('links.csv','r') as csvf:
for url in csvf:
yield scrapy.Request(url.strip())
def parse(self, response):
source = response.xpath("//script[contains(., 'COUNTRY_SHOP_STATUS')]/text()").extract()[0]
def get_values(parameter, script):
return re.findall('%s = "(.*)"' % parameter, script)[0]
with open('baza.csv', 'w') as csvfile:
fieldnames = ['Category', 'Type', 'SK']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
for pvi_subtype_name,pathIndicator.depth_5,model_name in zip(source):
writer.writerow({'Category': get_values("pvi_subtype_name", source), 'Type': get_values("pathIndicator.depth_5", source), 'SK': get_values("model_name", source)})
答案 0 :(得分:1)
S9的站点与S8的站点结构不同,因此总会出现错误,因为在S9中找不到COUNTRY_SHOP_STATUS。
直接使用csv-writer并不容易。您多次覆盖结果。因为您为每个产品打开了一个新的csv文件。如果您真的想那样做。在start_requests中打开csv文件,并在解析后追加。但是看看项目管道。 我用zip删除了循环,因为解析已经处于最低级别。
import scrapy
import csv
import re
class QuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
with open('so_52069753.csv','r') as csvf:
urlreader = csv.reader(csvf, delimiter=',',quotechar='"')
for url in urlreader:
if url[0]=="y":
yield scrapy.Request(url[1])
with open('so_52069753_out.csv', 'w') as csvfile:
fieldnames = ['Category', 'Type', 'SK']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
def parse(self, response):
def get_values(parameter, script):
return re.findall('%s = "(.*)"' % parameter, script)[0]
source_arr = response.xpath("//script[contains(., 'COUNTRY_SHOP_STATUS')]/text()").extract()
if source_arr:
source = source_arr[0]
#yield ({'Category': get_values("pvi_subtype_name", source), 'Type': get_values("pathIndicator.depth_5", source), 'SK': get_values("model_name", source)})
with open('so_52069753_out.csv', 'a') as csvfile:
fieldnames = ['Category', 'Type', 'SK']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writerow({'Category': get_values("pvi_subtype_name", source), 'Type': get_values("pathIndicator.depth_5", source), 'SK': get_values("model_name", source)})
我也更改了输入csv_file(so_52069753.csv):
y,https://www.samsung.com/uk/smartphones/galaxy-note8/
y,https://www.samsung.com/uk/smartphones/galaxy-s8/
y,https://www.samsung.com/uk/smartphones/galaxy-s9/
因此可以配置是否处理了url。