Question

我正在尝试使用BeautifulSoup来收集Morningstar Financials。由于某些原因，我什至找不到包含财务数据的表。

我尝试使用div标签和表标签。哪一个都没有运气。

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

urls= [
'http://financials.morningstar.com/ratios/r.html?t=AAPL&region=usa&culture=en-US'
      ]

for url in urls:

  try:
      html = uReq(url)
      page_soup = soup(html, "html.parser")

      mainPage = (page_soup.find("table",{"class":"r_table1 text2"}))

      print (mainPage)

  except:
      pass

当我搜索整个页面时，它不会返回任何表。我要抓取的数据表应该在“金融” div标签下。

Answer 1

数据通过AJAX加载（您需要检查开发者控制台以获取正确的URL）。即便如此，数据仍为JSONp格式，因此需要更多预处理：

type

打印表中的数据：

from bs4 import BeautifulSoup
import requests
import re
import json

url1 = 'http://financials.morningstar.com/finan/financials/getFinancePart.html?&callback=xxx&t=AAPL'
url2 = 'http://financials.morningstar.com/finan/financials/getKeyStatPart.html?&callback=xxx&t=AAPL'

soup1 = BeautifulSoup(json.loads(re.findall(r'xxx\((.*)\)', requests.get(url1).text)[0])['componentData'], 'lxml')
soup2 = BeautifulSoup(json.loads(re.findall(r'xxx\((.*)\)', requests.get(url2).text)[0])['componentData'], 'lxml')

def print_table(soup):
    for i, tr in enumerate(soup.select('tr')):
        row_data = [td.text for td in tr.select('td, th') if td.text]
        if not row_data:
            continue
        if len(row_data) < 12:
            row_data = ['X'] + row_data
        for j, td in enumerate(row_data):
            if j==0:
                print('{: >30}'.format(td), end='|')
            else:
                print('{: ^12}'.format(td), end='|')
        print()

print_table(soup1)
print()
print_table(soup2)

在Morningstar上使用Beautiful Soup无法找到桌子

1 个答案: