需要网络抓取帮助

时间:2020-05-25 14:20:15

标签: python-3.x web-scraping beautifulsoup

我想知道是否有人可以帮助我为 https://finance.yahoo.com/quote/TSCO.l?p=TSCO.L

我目前正在使用此代码抓取当前价格

currentPriceData = soup.find_all('div', {'class':'My(6px) Pos(r) smartphone_Mt(6px)'})[0].find('span').text

这可以正常工作,但有时我会收到一个错误,但不确定为什么链接正确无误。但我想再次获得价格

类似

try: 
    currentPriceData = soup.find_all('div', {'class':'My(6px) Pos(r) smartphone_Mt(6px)'})[0].find('span').text
except Exception:
    currentPriceData = soup.find('span', {'class':'Trsdu(0.3s) Fw(b) Fz(36px) Mb(-4px) D(ib)'})[0].text

问题是我无法使用这种方法来刮掉号码,将不胜感激。

1 个答案:

答案 0 :(得分:0)

数据作为Javascript变量嵌入到页面中。但是您可以使用json模块进行解析。

例如:

import re
import json
import requests

url = 'https://finance.yahoo.com/quote/TSCO.l?p=TSCO.L'

html_data = requests.get(url).text

#the next line extracts from the HTML source javascript variable
#that holds all data that is rendered on page.
#BeautifulSoup cannot run Javascript, so we are going to use
#`json` module to extract the data.
#NOTE: When you view source in Firefox/Chrome, you can search for
#      `root.App.main` to see it.

data = json.loads(re.search(r'root\.App\.main = ({.*?});\n', html_data).group(1))

# uncomment this to print all data:
# print(json.dumps(data, indent=4))

# We now have the Javascript variable extracted to standard python
# dict, so now we just print contents of some keys:

price = data['context']['dispatcher']['stores']['QuoteSummaryStore']['price']['regularMarketPrice']['fmt']
currency_symbol = data['context']['dispatcher']['stores']['QuoteSummaryStore']['price']['currencySymbol']

print('{} {}'.format(price, currency_symbol))

打印:

227.30 £