Question

我正在尝试从Yahoo Finance Statistics页面中抓取数据。在这种情况下，它是“ 5年平均股息收益率”。我需要的数据是这种类型的格式。

<tr>
  <td>
    <span>5 Year Average Dividend Yield</span>
  </td>
  <td class="Fz(s) Fw(500) Ta(end)">6.16</td>
</tr>

我是beautifulsoup的新手，我正在尝试阅读bs4 doco，但到目前为止还没有运气。我只是意识到我正在解析一个表。（是的，我是菜鸟）。

到目前为止，这是我的代码。它成功打印出表中的所有行。我需要帮助隔离包含“ 5年平均股息收益率”的行。我只需要在下一栏中输入数值即可。预先感谢。

新编辑：我将版本0.8放置在其下方，该版本将获得我一直在寻找的“ 5年平均股息收益率”值。

# Version 0.8 - This worked. It got the value for "5 Year Average Dividend Yield"
# Aim: Find value for"5 Year Average Dividend Yield".

import csv, os, time
import sys
from bs4 import BeautifulSoup
import urllib
import xlsxwriter
from selenium import webdriver
from importlib import reload

file_path = "C:/temp/temp29/"
file_name = "ASX_20180621_lite.txt"
file_path_name = file_path + file_name
print(file_path_name)

# Phase 1 - place all ticker symbols into an array
tickers_phase1_arr = []

with open(file_path_name, "rt") as incsv:
    readcsv = csv.reader(incsv, delimiter=',')
    rownum = 0
    colnum = 0
    for row in readcsv:
        ticker_phase1 = row[rownum]
        ticker_dot_ax = ticker_phase1 + ".AX"
        tickers_phase1_arr.append(ticker_dot_ax)
        #print(ticker)
        rownum + 1
    print(tickers_phase1_arr)


# Phase 2
key_stats_on_stat = ['5 Year Average Dividend Yield']


#Initialise the browser
browser = webdriver.PhantomJS()

tickers_phase2_arr = []
data = {}

for ticker_phase2 in tickers_phase1_arr:
    print(ticker_phase2)
    #time.sleep(5)
    #Set the main and stats url
    url = "https://finance.yahoo.com/quote/{0}/key-statistics?p={0}".format(ticker_phase2)
    #START - This block of code scrapes for the Previous Code value in the Main Page 
    browser.get(url)
    # Run a script that gets all the html in the webpage that the browser got from the get request
    innerHTML = browser.execute_script("return document.body.innerHTML")
    #Turn innerHTML into a BeautifulSoup object to make the components easier to access for scraping
    soup = BeautifulSoup(innerHTML, 'html.parser')
    # Find the Previous Close value
    for stat in key_stats_on_stat:
        page_stat = soup.find(text=stat)
        try:
            page_row = page_stat.find_parent('tr')
            try:
                page_statnum = page_row.find_all('span')[1].contents[0]
            except:
                page_statnum = page_row.find_all('td')[1].contents[0]
        except:
            print('Invalid parent for this element')
            page_statnum = "N/A"
        print(page_statnum)

Answer 1

有几种方法可以访问到td元素中包含所需值的td元素。其中一个方法是先在第一列中获取span元素，然后使用find_next()查找下一个td元素：

tr.find(text='5 Year Average Dividend Yield').find_next('td').get_text()

其中tr代表当前行。

另一种方法可能会更好一些。如果您需要经常执行此类请求，则可以构造一个字典，将第一列中的元素的文本作为键，将第二列中的元素的文本作为值：

data = {}
for tr in soup.find('table').find_all('tr'):
    first_cell, second_cell = tr.find_all('td')[:2]

    data[first_cell.get_text(strip=True)] = second_cell.get_text(strip=True)

然后，您可以通过第一列的文本查询data：

print(data['5 Year Average Dividend Yield'])

使用BeautifulSoup搜索Yahoo财务统计页面

1 个答案: