Scrapy 尝试抓取网站时报告模糊错误

时间:2021-05-29 16:44:24

标签: python web-scraping scrapy web-crawler yahoo-finance

我正在构建一个网络蜘蛛来抓取雅虎财经。我试图让它点击主页上的市场指数链接,并从相应市场指数页面上的表格中获取最后收盘价

2021-05-29 11:39:21 [scrapy.utils.log] INFO: Scrapy 2.3.0 started (bot: scrapybot)
2021-05-29 11:39:21 [scrapy.utils.log] INFO: Versions: lxml 4.6.2.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.8.5 (v3.8.5:580fbb018f, Jul 20 2020, 12:11:27) - [Clang 6.0 (clang-600.0.57)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g  21 Apr 2020), cryptography 3.0, Platform macOS-10.16-x86_64-i386-64bit
2021-05-29 11:39:21 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2021-05-29 11:39:21 [scrapy.crawler] INFO: Overridden settings:
{}
2021-05-29 11:39:21 [scrapy.extensions.telnet] INFO: Telnet Password: 8306af0a852a89a8
2021-05-29 11:39:21 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']

这是代码

import scrapy
from scrapy.crawler import CrawlerProcess

class YahooFinanceSpider(scrapy.Spider):
    name = "Yahoo Stock Scraper"
    button_loc = '//*[@id="marketsummary-itm-0"]/h3/a[1]'
    close_loc = '//*[@id="quote-summary"]/div[1]/table/tbody/tr[1]/td[2]/span/text()'

    def __init__(self, urls):
        self.urls=urls

    def start_requests(self):
        for url in self.urls:
            scrapy.Request(url=url, callback=self.parse_front)

    def parse_front(self, response):
        button = response.xpath(YahooFinanceSpider.button_loc)
        button_link = button.css('a.Fz\(s\).Ell.Fw\(600\).C\(\$linkColor ::attr(href)')
        links_to_follow = button_link.extract()
        for url in links_to_follow:
            yield response.follow(url = url, callback = self.parse_pages)

    def parse_pages(self, response):
        closing_value = response.xpath(YahooFinanceSpider.close_loc).extract()
        for value in closing_value:
            print(value)
            

prices = []

urls=['https://finance.yahoo.com/']

yscraper=YahooFinanceSpider(urls)

process = CrawlerProcess()
process.crawl(YahooFinanceSpider)
process.start()


1 个答案:

答案 0 :(得分:0)

您应该使用 $('.sidebar').click(function(e){ e.stopPropagation(); }); 而不是 process.crawl(yscraper)
您正在实例化对象 yscraper 但未使用它。