我正尝试从以下站点提取每种股票的“名称”,“最新价格”和“%”字段: https://markets.businessinsider.com/index/components/s&p_500
但是,即使我已经确认我的XPath在这些字段的Chrome控制台中都可以使用,也没有刮擦任何数据。
作为参考,我一直在使用此指南: https://realpython.com/web-scraping-with-scrapy-and-mongodb/
items.py
from scrapy.item import Item, Field
class InvestmentItem(Item):
ticker = Field()
name = Field()
px = Field()
pct = Field()
investment_spider.py
from scrapy import Spider
from scrapy.selector import Selector
from investment.items import InvestmentItem
class InvestmentSpider(Spider):
name = "investment"
allowed_domains = ["markets.businessinsider.com"]
start_urls = [
"https://markets.businessinsider.com/index/components/s&p_500",
]
def parse(self, response):
stocks = Selector(response).xpath('//*[@id="index-list-container"]/div[2]/table/tbody/tr')
for stock in stocks:
item = InvestmentItem()
item['name'] = stock.xpath('td[1]/a/text()').extract()[0]
item['px'] = stock.xpath('td[2]/text()[1]').extract()[0]
item['pct'] = stock.xpath('td[5]/span[2]').extract()[0]
yield item
控制台输出
...
2020-05-26 00:08:32 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://markets.businessinsider.com/robots.txt> (referer: None)
2020-05-26 00:08:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://markets.businessinsider.com/index/components/s&p_500> (referer: None)
2020-05-26 00:08:33 [scrapy.core.engine] INFO: Closing spider (finished)
2020-05-26 00:08:33 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
...
2020-05-26 00:08:33 [scrapy.core.engine] INFO: Spider closed (finished)
答案 0 :(得分:1)
乞求xpath表达式时,您缺少了“ ./”。 我简化了您的xpath:
def parse(self, response):
stocks = response.xpath('//table[@class="table table-small"]/tr')
for stock in stocks[1:]:
item = InvestmentItem()
item['name'] = stock.xpath('./td[1]/a/text()').get()
item['px'] = stock.xpath('./td[2]/text()[1]').get().strip()
item['pct'] = stock.xpath('./td[5]/span[2]/text()').get()
yield item
答案 1 :(得分:1)
XPATH 版本
def parse(self, response):
rows = response.xpath('//*[@id="index-list-container"]/div[2]/table/tr')
for row in rows:
yield{
'name' : row.xpath('td[1]/a/text()').extract(),
'price':row.xpath('td[2]/text()[1]').extract(),
'pct':row.xpath('td[5]/span[2]/text()').extract(),
'datetime':row.xpath('td[7]/span[2]/text()').extract(),
}
CSS 版本
def parse(self, response):
table = response.css('div#index-list-container table.table-small')
rows = table.css('tr')
for row in rows:
name = row.css("a::text").get()
high_low = row.css('td:nth-child(2)::text').get()
date_time = row.css('td:nth-child(7) span:nth-child(2) ::text').get()
yield {
'name' : name,
'high_low': high_low,
'date_time' : date_time
}
结果
{"high_low": "\r\n146.44", "name": "3M", "date_time": "05/26/2020 04:15:11 PM UTC-0400"},
{"high_low": "\r\n42.22", "name": "AO Smith", "date_time": "05/26/2020 04:15:11 PM UTC-0400"},
{"high_low": "\r\n91.47", "name": "Abbott Laboratories", "date_time": "05/26/2020 04:15:11 PM UTC-0400"},
{"high_low": "\r\n92.10", "name": "AbbVie", "date_time": "05/26/2020 04:15:11 PM UTC-0400"},
{"high_low": "\r\n193.71", "name": "Accenture", "date_time": "05/26/2020 04:15:11 PM UTC-0400"},
{"high_low": "\r\n73.08", "name": "Activision Blizzard", "date_time": "05/25/2020 08:00:00 PM UTC-0400"},
{"high_low": "\r\n385.26", "name": "Adobe", "date_time": "05/25/2020 08:00:00 PM UTC-0400"},
{"high_low": "\r\n133.48", "name": "Advance Auto Parts", "date_time": "05/26/2020 04:15:11 PM UTC-0400"},