我要删除此网站。 https://www.dhgate.com/wholesale/electronics-robots/c103032.html
我已经构建了一个可疑的代码:
import scrapy
from urllib.parse import urljoin
class DhgateSpider(scrapy.Spider):
name = 'dhgate'
allowed_domains = ['dhgate.com']
start_urls = ['https://www.dhgate.com/wholesale/electronics-robots/c103032.html']
def parse(self, response):
Product = response.xpath('//*[@class="pro-title"]/a/@title').extract()
Price = response.xpath('//*[@class="price"]/span/text()').extract()
Customer_review = response.xpath('//*[@class="reviewnum"]/span/text()').extract()
Seller = response.xpath('//*[@class="seller"]/a/text()').extract()
Feedback = response.xpath('//*[@class="feedback"]/span/text()').extract()
for item in zip(Product,Price,Customer_review,Seller,Feedback):
scraped_info = {
'Product':item[0],
'Price': item[1],
'Customer_review':item[2],
'Seller':item[2],
'Feedback':item[3],
}
yield scraped_info
next_page_url = response.xpath('//*[@class="next"]/@href').extract_first()
if next_page_url:
next_page_url = urljoin('https:',next_page_url)
yield scrapy.Request(url = next_page_url, callback = self.parse)
问题在于,并非每个容器都有客户评论或反馈项目。因此,它仅废弃具有完整产品,价格,customer_review,卖方和反馈项目的项目。我想报废所有容器,并且在没有customer_review的地方,我想添加一个空值。我怎么做?谢谢。
答案 0 :(得分:1)
请勿使用zip
:
def parse(self, response):
for product_node in response.xpath('//div[@id="proList"]/div[contains(@class, "listitem")]'):
Product = product_node.xpath('.//*[@class="pro-title"]/a/@title').extract_first()
Price = product_node.xpath('.//*[@class="price"]/span/text()').extract_first()
Customer_review = product_node.xpath('.//*[@class="reviewnum"]/span/text()').extract_first()
Seller = product_node.xpath('.//*[@class="seller"]/a/text()').extract_first()
Feedback = product_node.xpath('.//*[@class="feedback"]/span/text()').extract_first()
scraped_info = {
'Product':Product,
'Price': Price,
'Customer_review':Customer_review,
'Seller':Seller,
'Feedback':Feedback,
}
yield scraped_info
next_page_url = response.xpath('//*[@class="next"]/@href').extract_first()
if next_page_url:
next_page_url = urljoin('https:',next_page_url)
yield scrapy.Request(url = next_page_url, callback = self.parse)