使用Selenium的Scrapy仅爬行到第一个站点而不是多个站点

时间:2016-03-19 07:42:51

标签: selenium scrapy

我正在抓取网站上javascript生成的数据。因此,我正在使用scrapy和selen来抓取这些数据。但是,蜘蛛只能爬行并从第一个站点抓取数据。任何人都可以帮助我吗?下面是我写的代码。提前谢谢。

import scrapy
from scrapy.http import Request
import time
from selenium import webdriver

class w01item(scrapy.Item):
    date = scrapy.Field()
    title = scrapy.Field()
    underlying_bid = scrapy.Field()
    bid = scrapy.Field()

class mqSpider(scrapy.Spider):
    name = "w11"
    allowed_domains = ["kswarrants.kasikornsecurities.com"]
    start_urls = ["http://kswarrants.kasikornsecurities.com/www/Tool/calculator"]
    def __init__(self):
        self.driver = webdriver.Firefox()

    def parse(self, response):
        self.driver.get(response.url)
        self.driver.add_cookie({'name':'Disc', 'value':'YES','path':'/'})
        self.driver.get("http://kswarrants.kasikornsecurities.com/www/Tool/calculator")
        options=self.driver.find_elements_by_xpath('//select[@id="underling0"]/option')
        for option in options[1:4]:
            a = option.text
            textbox=self.driver.find_element_by_id("calid")
            textbox.send_keys(option.text)
            time.sleep(1)
            self.driver.find_element_by_id("btn_sub").click()
            time.sleep(2)
            for x in xrange(1,3):
                item = w01item()
                item['title']= a
                item['date'] = self.driver.find_element_by_id('d_1').text
                item['underlying_bid']= self.driver.find_element_by_id('d_'+ str(x)+'_1').text
                item['bid'] = self.driver.find_element_by_id('d_'+ str(x)+'_2').text
                yield item
            self.driver.find_element_by_id("calid").clear()

运行脚本的日志如下。

2016-03-21 23:14:56 [scrapy] INFO: Scrapy 1.0.5 started (bot: scrapybot)
2016-03-21 23:14:56 [scrapy] INFO: Optional features available: ssl, http11
2016-03-21 23:14:56 [scrapy] INFO: Overridden settings: {}
2016-03-21 23:14:56 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2016-03-21 23:15:02 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:55653/hub/session {"desiredCapabilities": {"platform": "ANY", "br
: "firefox", "version": "", "marionette": false, "javascriptEnabled": true}}
2016-03-21 23:15:02 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2016-03-21 23:15:02 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultH
leware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-03-21 23:15:02 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-03-21 23:15:02 [scrapy] INFO: Enabled item pipelines:
2016-03-21 23:15:02 [scrapy] INFO: Spider opened
2016-03-21 23:15:02 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-03-21 23:15:02 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-03-21 23:15:02 [scrapy] DEBUG: Redirecting (302) to <GET http://kswarrants.kasikornsecurities.com/www/Tool/Disc?rurl=calculator> from <GET http://kswarrant
securities.com/www/Tool/calculator>
2016-03-21 23:15:02 [scrapy] DEBUG: Crawled (200) <GET http://kswarrants.kasikornsecurities.com/www/Tool/Disc?rurl=calculator> (referer: None)
2016-03-21 23:15:03 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:55653/hub/session/415890c0-fdaa-4c41-80a5-1334d1d5ac8a/url {"url"
kswarrants.kasikornsecurities.com/www/Tool/Disc?rurl=calculator", "sessionId": "415890c0-fdaa-4c41-80a5-1334d1d5ac8a"}
2016-03-21 23:15:04 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2016-03-21 23:15:04 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:55653/hub/session/415890c0-fdaa-4c41-80a5-1334d1d5ac8a/cookie {"s
 "415890c0-fdaa-4c41-80a5-1334d1d5ac8a", "cookie": {"path": "/", "name": "Disc", "value": "YES"}}
2016-03-21 23:15:04 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2016-03-21 23:15:04 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:55653/hub/session/415890c0-fdaa-4c41-80a5-1334d1d5ac8a/url {"url"
kswarrants.kasikornsecurities.com/www/Tool/calculator", "sessionId": "415890c0-fdaa-4c41-80a5-1334d1d5ac8a"}
2016-03-21 23:15:05 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2016-03-21 23:15:05 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:55653/hub/session/415890c0-fdaa-4c41-80a5-1334d1d5ac8a/elements {
xpath", "sessionId": "415890c0-fdaa-4c41-80a5-1334d1d5ac8a", "value": "//select[@id=\"underling0\"]/option"}
2016-03-21 23:15:05 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2016-03-21 23:15:05 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://127.0.0.1:55653/hub/session/415890c0-fdaa-4c41-80a5-1334d1d5ac8a/element/{06
2-46ee-9e2c-720482b405d8}/text {"sessionId": "415890c0-fdaa-4c41-80a5-1334d1d5ac8a", "id": "{06d323c4-f252-46ee-9e2c-720482b405d8}"}
2016-03-21 23:15:05 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2016-03-21 23:15:05 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:55653/hub/session/415890c0-fdaa-4c41-80a5-1334d1d5ac8a/element {"
d", "sessionId": "415890c0-fdaa-4c41-80a5-1334d1d5ac8a", "value": "calid"}
2016-03-21 23:15:05 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2016-03-21 23:15:05 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://127.0.0.1:55653/hub/session/415890c0-fdaa-4c41-80a5-1334d1d5ac8a/element/{06
2-46ee-9e2c-720482b405d8}/text {"sessionId": "415890c0-fdaa-4c41-80a5-1334d1d5ac8a", "id": "{06d323c4-f252-46ee-9e2c-720482b405d8}"}
2016-03-21 23:15:05 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2016-03-21 23:15:05 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:55653/hub/session/415890c0-fdaa-4c41-80a5-1334d1d5ac8a/element/{a
62-4578-896c-9de40ce48162}/value {"sessionId": "415890c0-fdaa-4c41-80a5-1334d1d5ac8a", "id": "{a06dc4f5-0462-4578-896c-9de40ce48162}", "value": ["A", "A", "V",
"C", "1", "6", "0", "4", "A"]}
2016-03-21 23:15:06 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2016-03-21 23:15:07 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:55653/hub/session/415890c0-fdaa-4c41-80a5-1334d1d5ac8a/element {"
d", "sessionId": "415890c0-fdaa-4c41-80a5-1334d1d5ac8a", "value": "btn_sub"}
2016-03-21 23:15:07 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2016-03-21 23:15:07 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:55653/hub/session/415890c0-fdaa-4c41-80a5-1334d1d5ac8a/element/{e
ae-4848-9bac-450b5567842b}/click {"sessionId": "415890c0-fdaa-4c41-80a5-1334d1d5ac8a", "id": "{ee24c112-f7ae-4848-9bac-450b5567842b}"}
2016-03-21 23:15:08 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2016-03-21 23:15:10 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:55653/hub/session/415890c0-fdaa-4c41-80a5-1334d1d5ac8a/element {"
d", "sessionId": "415890c0-fdaa-4c41-80a5-1334d1d5ac8a", "value": "d_1"}
2016-03-21 23:15:10 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2016-03-21 23:15:10 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://127.0.0.1:55653/hub/session/415890c0-fdaa-4c41-80a5-1334d1d5ac8a/element/{74
0-45be-80bb-152e3cc78c6a}/text {"sessionId": "415890c0-fdaa-4c41-80a5-1334d1d5ac8a", "id": "{741d6ad8-2640-45be-80bb-152e3cc78c6a}"}
2016-03-21 23:15:10 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2016-03-21 23:15:10 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:55653/hub/session/415890c0-fdaa-4c41-80a5-1334d1d5ac8a/element {"
d", "sessionId": "415890c0-fdaa-4c41-80a5-1334d1d5ac8a", "value": "d_1_1"}
2016-03-21 23:15:10 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2016-03-21 23:15:10 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://127.0.0.1:55653/hub/session/415890c0-fdaa-4c41-80a5-1334d1d5ac8a/element/{d2
1-48bc-a76e-fccc5ad9e646}/text {"sessionId": "415890c0-fdaa-4c41-80a5-1334d1d5ac8a", "id": "{d25a795f-4721-48bc-a76e-fccc5ad9e646}"}
2016-03-21 23:15:10 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2016-03-21 23:15:10 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:55653/hub/session/415890c0-fdaa-4c41-80a5-1334d1d5ac8a/element {"
d", "sessionId": "415890c0-fdaa-4c41-80a5-1334d1d5ac8a", "value": "d_1_2"}
2016-03-21 23:15:10 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2016-03-21 23:15:10 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://127.0.0.1:55653/hub/session/415890c0-fdaa-4c41-80a5-1334d1d5ac8a/element/{3d
b-40a0-8880-9f5675bed655}/text {"sessionId": "415890c0-fdaa-4c41-80a5-1334d1d5ac8a", "id": "{3d380c09-248b-40a0-8880-9f5675bed655}"}
2016-03-21 23:15:10 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2016-03-21 23:15:10 [scrapy] DEBUG: Scraped from <200 http://kswarrants.kasikornsecurities.com/www/Tool/Disc?rurl=calculator>
{'bid': u'0.77',
 'date': u'21/03/2016',
 'title': u'AAV11C1604A',
 'underlying_bid': u'5.10'}
2016-03-21 23:15:10 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:55653/hub/session/415890c0-fdaa-4c41-80a5-1334d1d5ac8a/element {"
d", "sessionId": "415890c0-fdaa-4c41-80a5-1334d1d5ac8a", "value": "d_1"}
2016-03-21 23:15:10 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2016-03-21 23:15:10 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://127.0.0.1:55653/hub/session/415890c0-fdaa-4c41-80a5-1334d1d5ac8a/element/{74
0-45be-80bb-152e3cc78c6a}/text {"sessionId": "415890c0-fdaa-4c41-80a5-1334d1d5ac8a", "id": "{741d6ad8-2640-45be-80bb-152e3cc78c6a}"}
2016-03-21 23:15:10 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2016-03-21 23:15:10 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:55653/hub/session/415890c0-fdaa-4c41-80a5-1334d1d5ac8a/element {"
d", "sessionId": "415890c0-fdaa-4c41-80a5-1334d1d5ac8a", "value": "d_2_1"}
2016-03-21 23:15:10 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2016-03-21 23:15:10 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://127.0.0.1:55653/hub/session/415890c0-fdaa-4c41-80a5-1334d1d5ac8a/element/{60
8-4302-93dc-079c6e686055}/text {"sessionId": "415890c0-fdaa-4c41-80a5-1334d1d5ac8a", "id": "{606f3c95-5e98-4302-93dc-079c6e686055}"}
2016-03-21 23:15:10 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2016-03-21 23:15:10 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:55653/hub/session/415890c0-fdaa-4c41-80a5-1334d1d5ac8a/element {"
d", "sessionId": "415890c0-fdaa-4c41-80a5-1334d1d5ac8a", "value": "d_2_2"}
2016-03-21 23:15:10 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2016-03-21 23:15:10 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://127.0.0.1:55653/hub/session/415890c0-fdaa-4c41-80a5-1334d1d5ac8a/element/{8b
0-461f-aafe-0bdafa2c6d6f}/text {"sessionId": "415890c0-fdaa-4c41-80a5-1334d1d5ac8a", "id": "{8be98315-46d0-461f-aafe-0bdafa2c6d6f}"}
2016-03-21 23:15:10 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2016-03-21 23:15:10 [scrapy] DEBUG: Scraped from <200 http://kswarrants.kasikornsecurities.com/www/Tool/Disc?rurl=calculator>
{'bid': u'0.80',
 'date': u'21/03/2016',
 'title': u'AAV11C1604A',
 'underlying_bid': u'5.15'}
2016-03-21 23:15:10 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:55653/hub/session/415890c0-fdaa-4c41-80a5-1334d1d5ac8a/element {"
d", "sessionId": "415890c0-fdaa-4c41-80a5-1334d1d5ac8a", "value": "calid"}
2016-03-21 23:15:10 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2016-03-21 23:15:10 [selenium.webdriver.remote.remote_connection] DEBUG: POST http://127.0.0.1:55653/hub/session/415890c0-fdaa-4c41-80a5-1334d1d5ac8a/element/{9
70-408c-b683-5c363412cf0f}/clear {"sessionId": "415890c0-fdaa-4c41-80a5-1334d1d5ac8a", "id": "{9c1a9c4b-aa70-408c-b683-5c363412cf0f}"}
2016-03-21 23:15:11 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2016-03-21 23:15:11 [selenium.webdriver.remote.remote_connection] DEBUG: GET http://127.0.0.1:55653/hub/session/415890c0-fdaa-4c41-80a5-1334d1d5ac8a/element/{00
7-4a5b-86c9-96bed980ebef}/text {"sessionId": "415890c0-fdaa-4c41-80a5-1334d1d5ac8a", "id": "{001d727c-3f87-4a5b-86c9-96bed980ebef}"}
2016-03-21 23:15:12 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2016-03-21 23:15:12 [scrapy] ERROR: Spider error processing <GET http://kswarrants.kasikornsecurities.com/www/Tool/Disc?rurl=calculator> (referer: None)
Traceback (most recent call last):
  File "c:\python27\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback
    yield next(it)
  File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 28, in process_spider_output
    for x in result:
  File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 22, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "c:\python27\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 54, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "D:\testing\w11s.py", line 25, in parse
    a = option.text
  File "c:\python27\lib\site-packages\selenium\webdriver\remote\webelement.py", line 70, in text
    return self._execute(Command.GET_ELEMENT_TEXT)['value']
  File "c:\python27\lib\site-packages\selenium\webdriver\remote\webelement.py", line 457, in _execute
    return self._parent.execute(command, params)
  File "c:\python27\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 233, in execute
    self.error_handler.check_response(response)
  File "c:\python27\lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 194, in check_response
    raise exception_class(message, screen, stacktrace)
StaleElementReferenceException: Message: Element not found in the cache - perhaps the page has changed since it was looked up
Stacktrace:
    at fxdriver.cache.getElementAt (resource://fxdriver/modules/web-element-cache.js:9454)
    at Utils.getElementAt (file:///d:/ssd/tempfi~1/tmpzt51xr/extensions/fxdriver@googlecode.com/components/command-processor.js:9039)
    at WebElement.getElementText (file:///d:/ssd/tempfi~1/tmpzt51xr/extensions/fxdriver@googlecode.com/components/command-processor.js:12092)
    at DelayedCommand.prototype.executeInternal_/h (file:///d:/ssd/tempfi~1/tmpzt51xr/extensions/fxdriver@googlecode.com/components/command-processor.js:12661)
    at DelayedCommand.prototype.executeInternal_ (file:///d:/ssd/tempfi~1/tmpzt51xr/extensions/fxdriver@googlecode.com/components/command-processor.js:12666)
    at DelayedCommand.prototype.execute/< (file:///d:/ssd/tempfi~1/tmpzt51xr/extensions/fxdriver@googlecode.com/components/command-processor.js:12608)
2016-03-21 23:15:12 [scrapy] INFO: Closing spider (finished)

1 个答案:

答案 0 :(得分:0)

如果您想关注链接,则应<div id="headingq1"> <h1 id="QNumber">Question #1</h1> <h2 id="Question">About how many people live on planet Earth currently?</h2> <input type="text" id="textboxOne" /> <br> <p id="checkAnswer">(To check your answer, click my face!)</p> <img src="Steam Profile Photo2.png" id="profilePic" /> <br> <p id="tryAgain"></p><p id="correct"></p> </div>加入scrapy计划程序。

请参阅:http://doc.scrapy.org/en/latest/intro/tutorial.html#following-links