beautifulsoup抓取-缺少可扩展的标题文本

时间:2020-08-11 02:22:25

标签: python beautifulsoup

我试图使用BeautifulSoup从Y!Finance网站提取数据并将所有内容存储在列表中。在列表中,缺少可扩展行的标题(总收入,运营费用),但数字仍然存在。有没有办法在输出中包含标题?

import pandas as pd
from bs4 import BeautifulSoup
import urllib.request as ur

url = 'https://finance.yahoo.com/quote/AAPL/financials?p=AAPL'

read_data = ur.urlopen(url).read() 
soup= BeautifulSoup(read_data,'lxml')

ls= [] # Create empty list
for l in soup.find_all('div'): 
  ls.append(l.string) 


new_ls = list(filter(None,ls))

当前输出:

 'Expand All',
 'ttm',
 '9/30/2019',
 '9/30/2018',
 '9/30/2017',
 '9/30/2016',
 '273,857,000',
 '260,174,000',
 '265,595,000',
 '229,234,000',
 '215,639,000',

预期输出:

 'Expand All',
 'ttm',
 '9/30/2019',
 '9/30/2018',
 '9/30/2017',
 '9/30/2016',
 'Total Revenue',
 '273,857,000',
 '260,174,000',
 '265,595,000',
 '229,234,000',
 '215,639,000',

更新:如果我从“ span”中提取,则输出中缺少0的数字,这在以后构造数据框时会产生另一个问题

for l in soup.select('div.D\(tbr\)'): 
    for n in l.select('span'):
        print(n.text)

2 个答案:

答案 0 :(得分:2)

我知道这有点题外话,但看起来您只想要Yahoo Finance的数据正确吗?如果是这样,他们已经有了一个python软件包,使用它可能会比随后的Web抓取更容易。

https://pypi.org/project/yahoo-finance/

您可以输入共享

import numpy
import cv2

b = numpy.zeros([5,5,3], dtype=numpy.uint8)
b[:,:,0] = numpy.ones([5,5])*64
b[:,:,1] = numpy.ones([5,5])*128
b[:,:,2] = numpy.ones([5,5])*192

还可以通过使用以下命令来获取大量数据

apple = Share('AAPL')

答案 1 :(得分:0)

以下内容将为您提供所有数据,然后您可以过滤掉不需要的内容:

for row in soup.select('div[data-test="fin-row"]'):     
    for r in row:
        for l in r:
            print(l.text)
    print('-------\n')

输出:

Total Revenue
273,857,000
260,174,000
265,595,000
-
215,639,000
-------

Cost of Revenue
169,277,000
161,782,000
163,756,000
-
131,376,000
-------

Gross Profit

如果您还想以编程方式获取标题,请尝试:

head_ind = [55,58,60,62,64,66]
for i in head_ind:
    heads = f'span[data-reactid="{i}"]:not([class])'
    for head in soup.select(heads):
        print(head.text)

输出:

Breakdown
ttm
9/30/2019
9/30/2018
9/30/2017
9/30/2016