我正在开展一个项目,用scrapy 抓取雅虎金融最活跃的股票(网址:https://finance.yahoo.com/most-active)。
目前,此下有152只股票,每页显示25只。通常,在我在网上找到的大多数情况下,如果有人使用分页抓取网站,转到方法是使用 Xpath/CSS 选择器在第一页的 HTML 中找到所需的页面数并循环遍历它们。使用我当前的网站,我面临以下情况:
<button class="Va(m) H(20px) Fz(s) Bd(0) M(0) P(0) Bdendc($seperatorColor) O(n):f Bdendw(1px) Bdends(s) Pend(10px) Fw(500) C($linkColor)"><svg class="Va(m)! Fill($linkColor)! Stk($linkColor)! Cur(p)" width="18" height="18" viewBox="0 0 48 48" data-icon="caret-left" style="fill: rgb(0, 0, 0); stroke: rgb(0, 0, 0); stroke-width: 0; vertical-align: bottom;">
<path d="M16.14 24.102L28.865 36.83c.78.78 2.048.78 2.828 0 .78-.78.78-2.047 0-2.828l-9.9-9.9 9.9-9.9c.78-.78.78-2.047 0-2.827-.78-.78-2.047-.78-2.828 0L16.14 24.102z"></path>
</svg><span class="Va(m)">
<span>Prev</span></span>
</button>
<div class="D(ib) Fz(m) Fw(b) Lh(23px) W(75%)--mobp"><span><!-- react-text: 374 -->Matching <!-- /react-text --><span>Stocks</span></span><span class="Mstart(15px) Fw(500) Fz(s)"><span>1-25 of 152 results</span></span>
</div>
https://finance.yahoo.com/most-active?count=25&offset=0 https://finance.yahoo.com/most-active?count=25&offset=25 https://finance.yahoo.com/most-active?count=25&offset=50 https://finance.yahoo.com/most-active?count=25&offset=75
作为最后的手段,如果我无法对分页进行排序,我该如何循环遍历 URL,以便每次循环都会将偏移参数增加 25?
谢谢!
答案 0 :(得分:0)
https://finance.yahoo.com/most-active?count=152
只需在链接中使用参数 count
即可显示您要显示的股票数量。
答案 1 :(得分:0)
这是我能够开始工作的内容: 对于字符串“1-25 of 152 results”,我将其拆分并得到 152 并将其转换为整数。然后我将 152 除以 25 并加 1 以获得所需的总页数。之后,我生成了一个 25 的倍数列表,直到总页数(在本例中为 7)。我绕过它并拨打了电话。代码如下:
def get_stock_count(self,response):
count = str(response.xpath('// *[ @ id = "fin-scr-res-table"] / div[1] / div[1] / span[2] / span').css('::text').extract())
total_results = int(count.split()[-2])
total_offsets = total_results//25 + 1
offset_list = [i*25 for i in range(total_offsets)]
for offset in offset_list:
yield scrapy.Request(url=f'https://finance.yahoo.com/most-active?count=25&offset={offset}', callback=self.load_pagination)
添加这个并在后续调用中抓取页面(load_pagination)
我的完整蜘蛛代码:
name = 'apartmentspider'
def start_requests(self):
urls = ['https://finance.yahoo.com/most-active']
for url in urls:
yield scrapy.Request(url=url, callback=self.get_stock_count)
def get_stock_count(self,response):
count = str(response.xpath('// *[ @ id = "fin-scr-res-table"] / div[1] / div[1] / span[2] / span').css('::text').extract())
total_results = int(count.split()[-2])
total_offsets = total_results//25 + 1
offset_list = [i*25 for i in range(total_offsets)]
for offset in offset_list:
yield scrapy.Request(url=f'https://finance.yahoo.com/most-active?count=25&offset={offset}', callback=self.load_pagination)
def load_pagination(self, response):
stocks = response.xpath('//*[@id="scr-res-table"]/div[1]/table/tbody//tr/td[1]/a').css('::text').extract()
for stock in stocks:
print(stock)
yield scrapy.Request(url=f'https://finance.yahoo.com/quote/{stock}?p={stock}', callback= self.parse)
def parse(self, response):
items = ApartmentsprojectItem()
#Only added few fields for brevity
items['stock_name'] = response.xpath('//*[@id="quote-header-info"]/div[2]/div[1]/div[1]/h1').css('::text').extract()
items['intraday_price'] = response.xpath('//*[@id="quote-header-info"]/div[3]/div[1]/div/span[1]').css('::text').extract()
items['price_change'] = response.xpath('//*[@id="quote-header-info"]/div[3]/div[1]/div/span[2]').css('::text').extract()
yield items