Scrapy Spider保存到csv

时间:2018-02-07 14:02:00

标签: python selenium scrapy

我正在试图抓取一个网站并保存信息,目前我有两个问题。

首先,当我使用selenium点击按钮(在这种情况下是一个加载更多结果按钮)时,它不会点击直到结束,我似乎无法找出原因。

另一个问题是它没有保存到parse_article函数中的csv文件。

这是我的代码:

import scrapy
from selenium import webdriver
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from selenium.webdriver.common.by import By
import csv


class ProductSpider(scrapy.Spider):
    name = "Southwestern"
    allowed_domains = ['www.reuters.com/']
    start_urls = [
        'https://www.reuters.com/search/news?blob=National+Health+Investors%2c+Inc.']

    def __init__(self):
        self.driver = webdriver.Chrome()

    def parse(self, response):
        self.driver.get(response.url)

        while True:
            next = self.driver.find_element_by_class_name(
                "search-result-more-txt")
        #next = self.driver.find_element_by_xpath('//*[@id="content"]/section[2]/div/div[1]/div[4]/div/div[4]/div[1]')
        # maybe do it with this
        #button2 = driver.find_element_by_xpath("//*[contains(text(), 'Super')]")
            try:
                next.click()

            # get the data and write it to scrapy items
            except:
                break

        SET_SELECTOR = '.search-result-content'
        for articles in self.driver.find_elements(By.CSS_SELECTOR, SET_SELECTOR):
            item = {}
            # get the date
            item["date"] = articles.find_element_by_css_selector('h5').text
            # title
            item["title"] = articles.find_element_by_css_selector('h3 a').text

            item["link"] = articles.find_element_by_css_selector(
                'a').get_attribute('href')

            print(item["link"])

            yield scrapy.Request(url=item["link"], callback=self.parse_article, meta={'item': item})
        self.driver.close()

    def parse_article(self, response):
        item = response.meta['item']

        texts = response.xpath(
            "//div[contains(@class, 'StandardArticleBody')]//text()").extract()
        if "National Health Investors" in texts:
            item = response.meta['item']
            row = [item["date"], item["title"], item["link"]]
            with open('Websites.csv', 'w') as outcsv:
                writer = csv.writer(outcsv)
                writer.writerow(row)

2 个答案:

答案 0 :(得分:1)

  1. 点击后尝试稍等片刻,以便加载数据。我想有时你的脚本会在显示新数据和新按钮之前搜索一个按钮。
  2. 尝试使用implicit_wait或explicit_wait:

    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    
    # An implicit wait tells WebDriver to poll the DOM for a certain amount of time when trying to find any element
    # (or elements) not immediately available.
    driver.implicitly_wait(implicit_wait)
    
    # An explicit wait is code you define to wait for a certain condition to occur before proceeding further
    # in the code.
    wait = WebDriverWait(self.driver, <time in seconds>)
    wait.until(EC.presence_of_element_located((By.XPATH, button_xpath)))
    
    1. 'w'仅用于写入(将擦除具有相同名称的现有文件)。尝试'a'(追加)参数。虽然我建议使用管道:link

答案 1 :(得分:-1)

第一个问题看起来像按钮还没有出现。也许this可以帮助你。

还有一件事,在Scrapy关闭时尝试关闭driver。可能this可以帮助你。

第二个问题看起来你要做open并多次写,但这并不好,因为你将覆盖现有的内容。即使是'a'标志,例如open(FILE_NAME, 'a')这不是Scrapy的好习惯。

尝试创建Item填充它,然后使用Pipelines机制将项目保存在CSV文件中。类似于here