麻烦运行使用scrapy创建的解析器

时间:2018-06-10 21:19:05

标签: python python-3.x selenium web-scraping scrapy

我在Python scrapy中结合硒写了一个刮刀,从网站上抓取了一些titles。我的刮刀中定义的css selectors完美无瑕。我希望我的刮刀继续点击下一页并解析每页中嵌入的信息。它对第一页很好,但是当它扮演硒部分的角色时,刮刀一遍又一遍地点击同一个链接。

由于这是我第一次使用硒与scrapy一起工作,我不知道如何继续成功。任何修复都将受到高度赞赏。

如果我尝试这样,那么它可以顺利运行(选择器没有问题):

class IncomeTaxSpider(scrapy.Spider):
    name = "taxspider"

    start_urls = [
        'https://www.incometaxindia.gov.in/Pages/utilities/exempted-institutions.aspx',
    ]

    def __init__(self):
        self.driver = webdriver.Chrome()
        self.wait = WebDriverWait(self.driver, 10)

    def parse(self,response):
        self.driver.get(response.url)

        while True:
            for elem in self.wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR,"h1.faqsno-heading"))):
                name = elem.find_element_by_css_selector("div[id^='arrowex']").text
                print(name)

            try:
                self.wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "input[id$='_imgbtnNext']"))).click()
                self.wait.until(EC.staleness_of(elem))
            except TimeoutException:break

但我的目的是让我的脚本以这种方式运行:

import scrapy
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException

class IncomeTaxSpider(scrapy.Spider):
    name = "taxspider"

    start_urls = [
        'https://www.incometaxindia.gov.in/Pages/utilities/exempted-institutions.aspx',
    ]

    def __init__(self):
        self.driver = webdriver.Chrome()
        self.wait = WebDriverWait(self.driver, 10)

    def click_nextpage(self,link):
        self.driver.get(link)
        elem = self.wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div[id^='arrowex']")))

        #It keeeps clicking on the same link over and over again

        self.wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "input[id$='_imgbtnNext']"))).click()  
        self.wait.until(EC.staleness_of(elem))


    def parse(self,response):
        while True:
            for item in response.css("h1.faqsno-heading"):
                name = item.css("div[id^='arrowex']::text").extract_first()
                yield {"Name": name}

            try:
                self.click_nextpage(response.url) #initiate the method to do the clicking
            except TimeoutException:break

这些是在着陆页上可见的标题(让你知道我在追求的是什么):

INDIA INCLUSION FOUNDATION
INDIAN WILDLIFE CONSERVATION TRUST
VATSALYA URBAN AND RURAL DEVELOPMENT TRUST

我不愿意从该网站获取数据,因此除了我上面尝试过的任何替代方法对我来说都是无用的。我的唯一目的是找到与我在第二种方法中尝试的方式相关的任何解决方案。

5 个答案:

答案 0 :(得分:1)

如果您需要纯硒溶液:

driver.get("https://www.incometaxindia.gov.in/Pages/utilities/exempted-institutions.aspx")

while True:
    for item in wait(driver, 10).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "div[id^='arrowex']"))):
        print(item.text)
    try:
        driver.find_element_by_xpath("//input[@text='Next' and not(contains(@class, 'disabledImageButton'))]").click()
    except NoSuchElementException:
        break

答案 1 :(得分:1)

import scrapy
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
from scrapy.crawler import CrawlerProcess

class IncomeTaxSpider(scrapy.Spider):
    name = "taxspider"

    start_urls = [
        'https://www.incometaxindia.gov.in/Pages/utilities/exempted-institutions.aspx',
    ]

    def __init__(self):
        self.driver = webdriver.Chrome()
        self.wait = WebDriverWait(self.driver, 10)

        link = 'https://www.incometaxindia.gov.in/Pages/utilities/exempted-institutions.aspx'
        self.driver.get(link)

    def click_nextpage(self):        
        elem = self.wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div[id^='arrowex']")))

        #It keeeps clicking on the same link over and over again

        self.wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "input[id$='_imgbtnNext']"))).click()  
        self.wait.until(EC.staleness_of(elem))
        time.sleep(4)

    def parse(self,response):
        while True:
            for item in response.css("h1.faqsno-heading"):
                name = item.css("div[id^='arrowex']::text").extract_first()
                yield {"Name": name}

            try:
                self.click_nextpage() #initiate the method to do the clicking
            except TimeoutException:break

process = CrawlerProcess()

process.crawl(IncomeTaxSpider)
process.start()

答案 2 :(得分:1)

您的初始代码几乎正确,但缺少一个关键部分。您始终使用相同的响应对象。响应对象必须来自最新的页面来源。

此外,您还在单击下一页中一次又一次地浏览链接,该链接每次都将其重置为第1页。这就是为什么您获得第1页和第2页(最多)的原因。您只需在解析阶段获取一次url,然后单击下一页即可发生

下面的最终代码可以正常工作

class IncomeTaxSpider(scrapy.Spider):
    name = "taxspider"

    start_urls = [
        'https://www.incometaxindia.gov.in/Pages/utilities/exempted-institutions.aspx',
    ]

    def __init__(self):
        self.driver = webdriver.Chrome()
        self.wait = WebDriverWait(self.driver, 10)

    def click_nextpage(self,link):
        # self.driver.get(link)
        elem = self.wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div[id^='arrowex']")))

        #It keeeps clicking on the same link over and over again

        self.wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "input[id$='_imgbtnNext']"))).click()
        self.wait.until(EC.staleness_of(elem))


    def parse(self, response):
        self.driver.get(response.url)

        while True:
            for item in response.css("h1.faqsno-heading"):
                name = item.css("div[id^='arrowex']::text").extract_first()
                yield {"Name": name}

            try:
                self.click_nextpage(response.url) #initiate the method to do the clicking
                response = response.replace(body=self.driver.page_source)
            except TimeoutException:break

更改之后,效果很好

Working

答案 3 :(得分:0)

每当使用“下一页”箭头(使用Selenium)加载页面时,都会将其重置为“ 1”页。不确定原因(可能是Java脚本) 因此,更改了使用输入字段输入所需页码并按ENTER键进行导航的方法。

这是修改后的代码。希望这对您有用。

import scrapy
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.keys import Keys

class IncomeTaxSpider(scrapy.Spider):
    name = "taxspider"
    start_urls = [
        'https://www.incometaxindia.gov.in/Pages/utilities/exempted-institutions.aspx',
    ]
    def __init__(self):
        self.driver = webdriver.Firefox()
        self.wait = WebDriverWait(self.driver, 10)

    def click_nextpage(self,link, number):
        self.driver.get(link)
        elem = self.wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div[id^='arrowex']")))

        #It keeeps clicking on the same link over and over again

    inputElement = self.driver.find_element_by_xpath("//input[@id='ctl00_SPWebPartManager1_g_d6877ff2_42a8_4804_8802_6d49230dae8a_ctl00_txtPageNumber']")
    inputElement.clear()
    inputElement.send_keys(number)
    inputElement.send_keys(Keys.ENTER)
        self.wait.until(EC.staleness_of(elem))


    def parse(self,response):
        number = 1
        while number < 10412: #Website shows it has 10411 pages.
            for item in response.css("h1.faqsno-heading"):
                name = item.css("div[id^='arrowex']::text").extract_first()
                yield {"Name": name}
                print (name)

            try:
                number += 1
                self.click_nextpage(response.url, number) #initiate the method to do the clicking
            except TimeoutException:break

答案 4 :(得分:0)

创建一个self.page_num之类的东西。

def parse(self,response):
    self.pages = self.driver.find_element_by_css_selector("#ctl00_SPWebPartManager1_g_d6877ff2_42a8_4804_8802_6d49230dae8a_ctl00_totalRecordsDiv.act_search_footer span")
    self.pages = int(self.pages.split('of ')[1].split(']')[0])

    self.page_num = 1

    while self.page_num <= self.pages:
        for item in response.css("h1.faqsno-heading"):
            name = item.css("div[id^='arrowex']::text").extract_first()
            yield {"Name": name}

        try:
            self.click_nextpage(response.url) #initiate the method to do the clicking
        except TimeoutException:break

def click_nextpage(self,link):
    self.driver.get(link)
    elem = self.wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div[id^='arrowex']")))

    page_link = 'ctl00_SPWebPartManager1_g_d6877ff2_42a8_4804_8802_6d49230dae8a_ctl00_lnkBtn_' + str(self.page_num)
    self.page_num = self.page_num + 1


    self.wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "input[id$='_imgbtnNext']"))).click()  
    self.wait.until(EC.staleness_of(elem))