使用BeautifulSoup4

时间:2016-02-23 20:01:57

标签: html parsing web-scraping beautifulsoup

我正试图从雅虎财经的摘要页面中删除“市值”数据。

Chrome检查工具的html数据如下所示: enter image description here

我的代码是:

from urllib.request import urlopen
from bs4 import BeautifulSoup

sp500short = ['a', 'aa', 'aapl', 'abbv', 'abc', 'abt', 'aci', 'acn', 'act', 'adbe', 'adi', 'adm', 'adp']
dowJones = ['mmm', 'axp', 'aapl', 'ba', 'cat', 'cvx', 'csco', 'ko', 'dd', 'xom', 'ge', 'gs', 'hd', 'intc', 'ibm', 'jpm', 'jnj', 'mcd', 'mrk', 'msft', 'nke', 'pfe', 'pg', 'trv', 'utx', 'unh', 'vz', 'v', 'wmt', 'dis']


def stockScreener():

    for ticker in sp500short:
        searchSummary = "http://finance.yahoo.com/q?s="+ticker
        summary = urlopen(searchSummary)
        summaryHtml = summary.read()
        summarySoup = BeautifulSoup(summaryHtml, "html.parser")

        try:
            marketCap = summarySoup.find("th scope", text="Market Cap:").find_next_sibling("td").text

        except:
            marketCap = "There is no data for this company" 

        if marketCap == "There is no data for this company":
            print(ticker+" "+marketCap)            
        else:
            output = marketCap[:-1]
            print(ticker + str(output))

stockScreener()

我的.find()电话有什么问题?

1 个答案:

答案 0 :(得分:1)

你太近了 - 你只需从行中删除scope

marketCap = summarySoup.find("th scope", text="Market Cap:").find_next_sibling("td").text

它应该是这样的:

marketCap = summarySoup.find("th", text="Market Cap:").find_next_sibling("td").text

scope是您尝试获取的<td>标记的属性,而不是标记本身的一部分