我是网络抓取的新手,我正在努力为APL抓取Yahoo Finance的“统计”页面。这是链接:https://finance.yahoo.com/quote/AAPL/key-statistics?p=AAPL
这是我到目前为止的代码...
from bs4 import BeautifulSoup
from requests import get
url = 'https://finance.yahoo.com/quote/AAPL/key-statistics?p=AAPL'
response = get(url)
soup = BeautifulSoup(response.text, 'html.parser')
stock_data = soup.find_all("table")
for stock in stock_data:
print(stock.text)
运行该命令时,我将返回页面上的所有表数据。但是,我只想要每个表格中的特定数据(例如“市值”,“收入”,“测试版”)。
我通过做print(stock[1].text)
来弄乱代码,看看是否可以将返回的数据量限制为每个表中的第二个值,但返回错误消息。通过使用BeautifulSoup,我是否处在正确的轨道上?还是需要使用完全不同的库?为了只返回特定数据而不返回页面上的所有表数据,我该怎么办?
答案 0 :(得分:2)
检查HTML代码可让您最好地了解BeautifulSoup将如何处理所看到的内容。
该网页似乎包含几个表,这些表又包含您要查找的信息。这些表遵循一定的逻辑。
首先抓取网页上的所有表,然后查找这些行包含的所有表行(
以下是实现此目的的一种方法。我什至提供了一个仅打印特定测量值的功能。
from bs4 import BeautifulSoup
from requests import get
url = 'https://finance.yahoo.com/quote/AAPL/key-statistics?p=AAPL'
response = get(url)
soup = BeautifulSoup(response.text, 'html.parser')
stock_data = soup.find_all("table")
# stock_data will contain multiple tables, next we examine each table one by one
for table in stock_data:
# Scrape all table rows into variable trs
trs = table.find_all('tr')
for tr in trs:
# Scrape all table data tags into variable tds
tds = tr.find_all('td')
# Index 0 of tds will contain the measurement
print("Measure: {}".format(tds[0].get_text()))
# Index 1 of tds will contain the value
print("Value: {}".format(tds[1].get_text()))
print("")
def get_measurement(table_array, measurement):
for table in table_array:
trs = table.find_all('tr')
for tr in trs:
tds = tr.find_all('td')
if measurement.lower() in tds[0].get_text().lower():
return(tds[1].get_text())
# print only one measurement, e.g. operating cash flow
print(get_measurement(stock_data, "operating cash flow"))
答案 1 :(得分:0)
尽管这不是Yahoo Finance,但您可以执行类似的操作...
import requests
from bs4 import BeautifulSoup
base_url = 'https://finviz.com/screener.ashx?v=152&o=price&t=MSFT,AAPL,SBUX,S,GOOG&o=price&c=0,1,2,3,4,5,6,7,8,9,25,63,64,65,66,67'
html = requests.get(base_url)
soup = BeautifulSoup(html.content, "html.parser")
main_div = soup.find('div', attrs = {'id':'screener-content'})
light_rows = main_div.find_all('tr', class_="table-light-row-cp")
dark_rows = main_div.find_all('tr', class_="table-dark-row-cp")
data = []
for rows_set in (light_rows, dark_rows):
for row in rows_set:
row_data = []
for cell in row.find_all('td'):
val = cell.a.get_text()
row_data.append(val)
data.append(row_data)
# sort rows to maintain original order
data.sort(key=lambda x: int(x[0]))
import pandas
pandas.DataFrame(data).to_csv("C:\\your_path\\AAA.csv", header=False)
如果Yahoo决定降低其API的更多功能,这是一个很好的替代品。我知道几年前,他们削减了很多东西(大部分是历史名言)。这是悲伤地看到,走开。