所以试图从jet.com抓取动态加载的页面,实际的网址是:
https://jet.com/product/Pringles-Pizza-Potato-Crisps-55-Oz/b809dbfac8de4758b3234a82ff562fd5
我在下面包含了我的蜘蛛,页面在Chrome浏览器中加载,但加载器没有返回任何结果。完全诚实,我对python和scrapy非常陌生,今天早上刚刚开始玩Selenium,不幸的是,当你能找到它时,使用Scrapy和Selenium以及项目加载器的文档有点受欢迎。任何提示都会有所帮助,我担心这可能是一个明显的错误,但我不能很快看到它。
import scrapy
from scrapy.loader import ItemLoader
from JetScrape.items import ProductLoader, JetProduct
import datetime
from selenium import webdriver
import time
class JetSpider(scrapy.Spider):
name = "jet"
allowed_domains = ["jet.com"]
with open("JetURL.txt", "rt") as f:
start_urls = [url.strip() for url in f.readlines()]
def __init__(self):
scrapy.Spider.__init__(self)
self.br = webdriver.Chrome()
def _del_(self):
self.br.close()
def parse(self, response):
self.br.get(response.url)
time.sleep(3)
Today = datetime.datetime.now()
jetload = ProductLoader(item=JetProduct(), selector=self.br.page_source)
jetload.add_xpath("jetprice", "//span[@class='formatted-value']/text()")
jetload.add_xpath("jettitle", "//h1[@class='name']/text()")
jetload.add_value("jetLast_Updated", Today)
yield jetload.load_item()
答案 0 :(得分:0)
您应该将scrapy选择器项传递给ItemLoader函数。
您可以尝试以下修改:
from scrapy import Selector
...
def parse(self, response):
...
br_selector = Selector(text = self.br.page_source)
jetload = ProductLoader(item = JetProduct(), selector = br_selector)
完整的蜘蛛代码应如下所示:
import datetime
import time
import scrapy
from selenium import webdriver
from scrapy import Selector
from scrapy.loader import ItemLoader
from JetScrape.items import ProductLoader, JetProduct
class JetSpider(scrapy.Spider):
name = "jet"
allowed_domains = ["jet.com"]
with open("JetURL.txt", "rt") as f:
start_urls = [url.strip() for url in f.readlines()]
def __init__(self):
scrapy.Spider.__init__(self)
self.br = webdriver.Chrome()
def _del_(self):
self.br.close()
def parse(self, response):
self.br.get(response.url)
time.sleep(3)
Today = datetime.datetime.now()
br_selector = Selector(text = self.br.page_source)
jetload = ProductLoader(item=JetProduct(), selector=br_selector)
jetload.add_xpath("jetprice", "//span[@class='formatted-value']/text()")
jetload.add_xpath("jettitle", "//h1[@class='name']/text()")
jetload.add_value("jetLast_Updated", Today)
yield jetload.load_item()
如果这不起作用,请发布您的items.py
内容,以便重现完整的抓取工具。