Scrapy和Selenium加载网页但在项目加载器中没有返回任何结果

时间:2017-11-17 17:35:42

标签: python selenium scrapy

所以试图从jet.com抓取动态加载的页面,实际的网址是:

https://jet.com/product/Pringles-Pizza-Potato-Crisps-55-Oz/b809dbfac8de4758b3234a82ff562fd5

我在下面包含了我的蜘蛛,页面在Chrome浏览器中加载,但加载器没有返回任何结果。完全诚实,我对python和scrapy非常陌生,今天早上刚刚开始玩Selenium,不幸的是,当你能找到它时,使用Scrapy和Selenium以及项目加载器的文档有点受欢迎。任何提示都会有所帮助,我担心这可能是一个明显的错误,但我不能很快看到它。

import scrapy
from scrapy.loader import ItemLoader
from JetScrape.items import ProductLoader, JetProduct
import datetime
from selenium import webdriver
import time



class JetSpider(scrapy.Spider):
    name = "jet"
    allowed_domains = ["jet.com"]
    with open("JetURL.txt", "rt") as f:
        start_urls = [url.strip() for url in f.readlines()]

    def __init__(self):
        scrapy.Spider.__init__(self)
        self.br = webdriver.Chrome()

    def _del_(self):
        self.br.close()

    def parse(self, response):
        self.br.get(response.url)
        time.sleep(3)
        Today = datetime.datetime.now()
        jetload = ProductLoader(item=JetProduct(), selector=self.br.page_source)
        jetload.add_xpath("jetprice", "//span[@class='formatted-value']/text()")
        jetload.add_xpath("jettitle", "//h1[@class='name']/text()")
        jetload.add_value("jetLast_Updated", Today)
        yield jetload.load_item()

1 个答案:

答案 0 :(得分:0)

您应该将scrapy选择器项传递给ItemLoader函数。

您可以尝试以下修改:

from scrapy import Selector
...
def parse(self, response):
    ...
    br_selector = Selector(text = self.br.page_source)
    jetload = ProductLoader(item = JetProduct(), selector = br_selector)

完整的蜘蛛代码应如下所示:

import datetime
import time
import scrapy
from selenium import webdriver
from scrapy import Selector
from scrapy.loader import ItemLoader
from JetScrape.items import ProductLoader, JetProduct


class JetSpider(scrapy.Spider):
    name = "jet"
    allowed_domains = ["jet.com"]
    with open("JetURL.txt", "rt") as f:
        start_urls = [url.strip() for url in f.readlines()]

    def __init__(self):
        scrapy.Spider.__init__(self)
        self.br = webdriver.Chrome()

    def _del_(self):
        self.br.close()

    def parse(self, response):
        self.br.get(response.url)
        time.sleep(3)
        Today = datetime.datetime.now()
        br_selector = Selector(text = self.br.page_source)
        jetload = ProductLoader(item=JetProduct(), selector=br_selector)
        jetload.add_xpath("jetprice", "//span[@class='formatted-value']/text()")
        jetload.add_xpath("jettitle", "//h1[@class='name']/text()")
        jetload.add_value("jetLast_Updated", Today)
        yield jetload.load_item()

如果这不起作用,请发布您的items.py内容,以便重现完整的抓取工具。