Question

我正在尝试使用Python从Yahoo Finance的损益表中获取数据。具体来说，假设我想要most recent figure of Net Income of Apple。

数据由一堆嵌套的HTML表组成。我正在使用requests模块来访问它并检索HTML。

我使用BeautifulSoup 4来筛选HTML结构，但我无法弄清楚如何获得数字。

Here是使用Firefox进行分析的屏幕截图。

到目前为止我的代码：

from bs4 import BeautifulSoup
import requests

myurl = "https://finance.yahoo.com/q/is?s=AAPL&annual"
html = requests.get(myurl).content
soup = BeautifulSoup(html)

我尝试使用

all_strong = soup.find_all("strong")

然后得到第17个元素，恰好是包含我想要的数字的元素，但这似乎远非优雅。像这样：

all_strong[16].parent.next_sibling
...

当然，我们的目标是使用BeautifulSoup来搜索我需要的人物的名称（在这种情况下为“净收入”），然后抓住数字他们自己在HTML表的同一行。

我真的很感激有关如何解决此问题的任何想法，请记住我想应用该解决方案从其他Yahoo Finance页面检索大量其他数据。

解决方案/扩展：

@wilbur下面的解决方案有效，我对其进行了扩展，以便能够获得任何财务页面上的任何数字的值（即{{ 任何上市公司的3}}，Income Statement和Balance Sheet）。我的功能如下：

def periodic_figure_values(soup, yahoo_figure):

    values = []
    pattern = re.compile(yahoo_figure)

    title = soup.find("strong", text=pattern)    # works for the figures printed in bold
    if title:
        row = title.parent.parent
    else:
        title = soup.find("td", text=pattern)    # works for any other available figure
        if title:
            row = title.parent
        else:
            sys.exit("Invalid figure '" + yahoo_figure + "' passed.")

    cells = row.find_all("td")[1:]    # exclude the <td> with figure name
    for cell in cells:
        if cell.text.strip() != yahoo_figure:    # needed because some figures are indented
            str_value = cell.text.strip().replace(",", "").replace("(", "-").replace(")", "")
            if str_value == "-":
                str_value = 0
            value = int(str_value) * 1000
            values.append(value)

    return values

yahoo_figure变量是一个字符串。显然，这必须与Yahoo Finance上使用的图名相同。要传递soup变量，我首先使用以下函数：

def financials_soup(ticker_symbol, statement="is", quarterly=False):

    if statement == "is" or statement == "bs" or statement == "cf":
        url = "https://finance.yahoo.com/q/" + statement + "?s=" + ticker_symbol
        if not quarterly:
            url += "&annual"
        return BeautifulSoup(requests.get(url).text, "html.parser")

    return sys.exit("Invalid financial statement code '" + statement + "' passed.")

样本使用 - 我想从上次可用的损益表中获得Apple Inc.的所得税费用：

print(periodic_figure_values(financials_soup("AAPL", "is"), "Income Tax Expense"))

输出：[19121000000, 13973000000, 13118000000]

您还可以从soup获取期末的日期，并创建一个字典，其中日期是键，数字是值，但这会使这个帖子太久了。到目前为止，这似乎对我有用，但我总是感谢建设性的批评。

Answer 1

这有点困难，因为＆＃34;净收入＆＃34;在<strong>标签中附上，请耐心等待，但我认为这有效：

import re, requests
from bs4 import BeautifulSoup

url = 'https://finance.yahoo.com/q/is?s=AAPL&annual'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
pattern = re.compile('Net Income')

title = soup.find('strong', text=pattern)
row = title.parent.parent # yes, yes, I know it's not the prettiest
cells = row.find_all('td')[1:] #exclude the <td> with 'Net Income'

values = [ c.text.strip() for c in cells ]

在这种情况下，

values将包含三个表单元格＆＃34;净收入＆＃34;行（并且，我可以添加，可以很容易地转换为整数 - 我只是喜欢他们保持＆＃39;，＆＃39;作为字符串）

In [10]: values
Out[10]: [u'53,394,000', u'39,510,000', u'37,037,000']

当我在Alphabet（GOOG）上测试时 - 它不起作用，因为他们没有显示我相信的损益表（https://finance.yahoo.com/q/is?s=GOOG&annual）但是当我检查Facebook（FB）时，值已正确返回（https://finance.yahoo.com/q/is?s=FB&annual）。

如果您想创建一个更加动态的脚本，可以使用字符串格式化来设置您想要的任何股票代码的网址，如下所示：

ticker_symbol = 'AAPL' # or 'FB' or any other ticker symbol
url = 'https://finance.yahoo.com/q/is?s={}&annual'.format(ticker_symbol))

使用Python抓取雅虎财务损益表

1 个答案: