我正在尝试从此 page 中提取表数据。在网络工具中导航后,我发现 api 调用可以为我提供所需的表数据,因此我尝试使用 python scrapy 模拟请求。这是代码和响应消息。
In [27]: url
Out[27]: 'https://www.barchart.com/proxies/core-api/v1/quotes/get?symbol=MSFT&lists=stocks.inSector.all(-COSO)&fields=symbol,symbolName,weightedAlpha,lastPrice,priceChange,percentChange,highPrice1y,lowPrice1y,percentChange1y,tradeTime,symbolCode,symbolType,hasOptions&orderBy=weightedAlpha&orderDir=desc&meta=field.shortName,field.type,field.description&hasOptions=true&page=1&limit=100&raw=1'
In [28]: headers
Out[28]: {'X-XSRF-TOKEN': 'eyJpdiI6Ims2ZVJxT3pRRUplSCtLZXRVZXA3cXc9PSIsInZhbHVlIjoiaDJaQ0hhVWQwUU9zMEQ2S1FqVEVxR3hPYTJYRzd3d0VWWkZzMUhYQmRPSGVoaWVtTnBNUXZzdkJhTngvS2xNLyIsIm1hYyI6Ijc3MzY1N2M4ZDljMWQ4MDY4OTA5ZGQwNmUzYThiNDNkMDNlZDUyZmQ1Mjc4ZTU0MzkwMjA3ZDFmMDAwMTdkYTMifQ=='}
In [29]: fetch(scrapy.Request(url,headers=headers))
2021-03-03 12:12:55 [scrapy.core.engine] DEBUG: Crawled (401) <GET https://www.barchart.com/proxies/core-api/v1/quotes/get?symbol=MSFT&lists=stocks.inSector.all(-COSO)&fields=symbol,symbolName,weightedAlpha,lastPrice,priceChange,percentChange,highPrice1y,lowPrice1y,percentChange1y,tradeTime,symbolCode,symbolType,hasOptions&orderBy=weightedAlpha&orderDir=desc&meta=field.shortName,field.type,field.description&hasOptions=true&page=1&limit=100&raw=1> (referer: None)
我在标题或其他地方遗漏了什么吗?
答案 0 :(得分:1)
当您访问 https://www.barchart.com/stocks/quotes/MSFT/competitors 时,您会收到带有 set-cookie=larvel-token...
和其他一些 cookie 的响应标头。我尝试了所有 cookie,laravel-token
是用于身份验证的 cookie。您还需要提取已经提取的 x-xsrf-token。
在 Scrapy 中解决您的问题。首先确保您在 settings.py 中启用了 cookie。 然后向:https://www.barchart.com/stocks/quotes/MSFT/competitors 发送请求。在该请求的解析方法中,您将下一个请求发送到您在上面发送的 url。然后 Scrapy 会自动处理 cookie。
这是一个对我有用的示例蜘蛛(我很草率地提取了 xsrf 令牌,您可能有更好的方法):
import re
from urllib.parse import unquote
import scrapy
class TestSpider(scrapy.Spider):
name='testspider'
def start_requests(self):
yield scrapy.Request(
url='https://www.barchart.com/stocks/quotes/MSFT/competitors',
)
def parse(self, response):
for set_cookie in response.headers.getlist('Set-Cookie'):
try:
xsrf_token = re.findall(r'XSRF-TOKEN=(\w+==);', unquote(set_cookie.decode('utf-8')))[0]
except IndexError:
pass
yield scrapy.Request(
url='https://www.barchart.com/proxies/core-api/v1/quotes/get?'\
'symbol=MSFT&lists=stocks.inSector.all(-COSO)&fields=symb'\
'ol,symbolName,weightedAlpha,lastPrice,priceChange,percen'\
'tChange,highPrice1y,lowPrice1y,percentChange1y,tradeTime'\
',symbolCode,symbolType,hasOptions&orderBy=weightedAlpha&'\
'orderDir=desc&meta=field.shortName,field.type,field.desc'\
'ription&hasOptions=true&page=1&limit=100&raw=1',
callback=self.parse_data,
headers={
'x-xsrf-token': xsrf_token
}
)
def parse_data(self, response):
pass
输出
2021-03-03 12:26:24 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.barchart.com/stocks/quotes/MSFT/competitors> (referer: None)
2021-03-03 12:26:24 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.barchart.com/proxies/core-api/v1/quotes/get?symbol=MSFT&lists=stocks.inSector.all(-COSO)&fields=symbol,symbolName,weightedAlpha,lastPrice,priceChange,percentChange,highPrice1y,lowPrice1y,percentChange1y,tradeTime,symbolCode,symbolType,hasOptions&orderBy=weightedAlpha&orderDir=desc&meta=field.shortName,field.type,field.description&hasOptions=true&page=1&limit=100&raw=1> (referer: https://www.barchart.com/stocks/quotes/MSFT/competitors)