Question

我正在尝试使用BeautifulSoup和Python 2.7从Yahoo财务网页上的Risk Statistics表中提取一些数字：https://finance.yahoo.com/quote/SHSAX/risk

到目前为止，我已经使用https://codebeautify.org查看了html：

#!/usr/bin/python
from bs4 import BeautifulSoup, Comment
import urllib

riskURL = "https://finance.yahoo.com/quote/SHSAX/risk"
page = urllib.urlopen(riskURL)
content = page.read().decode('utf-8')
soup = BeautifulSoup(content, 'html.parser')

我的麻烦实际上是使用汤。查找来获取数字。例如，标准差：

    # std should be 13.44
    stdevValue = float(soup.find("span",{"data-reactid":"124","class":"W(39%) Fl(start)"}).text)
    # std of category should be 0.18
    stdevCat = float(soup.find("span",{"data-reactid":"125","class":"W(57%) Mend(5px) Fl(end)"}).text)

这两个对soup.find的调用均未返回任何内容。我想念什么？

Answer 1

根据我在网上阅读的内容，“ data-reactid”是react框架用于引用组件的自定义属性（您可以在此处what's data-reactid attribute in html?中阅读更多内容），经过几次尝试，我注意到在每个重新加载页面时，数据反应性属性不同，例如随机生成。

我认为您应该尝试找到另一种方法来实现这一目标。

也许您可以尝试查找诸如“标准偏差”行之类的特定元素，然后向下循环以收集数据。

std_span = next(x for x in soup.find_all('span') if x.text == "Standard Deviation")
parent_div = std_span.parent
for sibling in parent_div.next_siblings:
   for child in sibling.children:
      # do something
      print(child.text)

希望有帮助。

Answer 2

from bs4 import BeautifulSoup, Comment
import urllib


riskURL = "https://finance.yahoo.com/quote/SHSAX/risk"
page = urllib.request.urlopen(riskURL)
content = page.read().decode('utf-8')
soup = BeautifulSoup(content, 'html.parser')
#W(25%) Fl(start) Ta(e)
results = soup.find("span", {"data-reactid" : "121"})
print results.text

或者，您可以使用正则表达式和findNext来获取值：

from bs4 import BeautifulSoup, Comment
import urllib


riskURL = "https://finance.yahoo.com/quote/SHSAX/risk"
page = urllib.request.urlopen(riskURL)
content = page.read().decode('utf-8')
soup = BeautifulSoup(content, 'html.parser')
for span in soup.find_all('span',text=re.compile('^(Standard Deviation)')):
    print span.findNext('span').text

使用Beautiful Soup刮除Yahoo Finance的标准差

2 个答案: