Python webscraping - NoneObeject Failure - 破坏HTML?

时间:2015-12-16 15:10:49

标签: python html web-scraping beautifulsoup

我的python中的解析脚本出了问题。香港专业教育学院已经在另一个页面(雅虎财务)尝试过,它运行良好。然而在晨星上它不起作用。 我在终端中得到了错误" NoneObject"表变量。我想这与moriningstar网站的结构有关,但我不确定。 Maybey somne​​one可以告诉我出了什么问题。 或者由于晨星网站的网站结构使用我的简单脚本,这是不可能的?

直接来自morningstar的简单csv导出不是解决方案,因为我想将该脚本用于其他没有此功能的网站。

import requests
import csv
from bs4 import BeautifulSoup
from lxml import html

url = 'http://financials.morningstar.com/ratios/r.html?t=SBUX&region=USA&culture=en_US'
response = requests.get(url)
html = response.content

soup = BeautifulSoup(html)
table = soup.find('table', attrs={'class': 'r_table1 text2'})

print table.prettify() #debugging

list_of_rows = []
for row in table.findAll('tr'):
   list_of_cells =[]

   for cell in row.findAll(['th','td']):
        text = cell.text.replace(' ', '')
        list_of_cells.append(text)
   list_of_rows.append(list_of_cells)
print list_of_rows #debugging

outfile = open("./test.csv", "wb")
writer = csv.writer(outfile)
writer.writerows(list_of_rows)

1 个答案:

答案 0 :(得分:1)

该表是动态加载,对端点进行单独的XHR调用,该端点将返回JSONP响应。模拟该请求,从JSONP响应中提取JSON字符串,使用json加载它,从componentData密钥中提取HTML并使用BeautifulSoup加载:

import json
import re

import requests
from bs4 import BeautifulSoup

# make a request
url = 'http://financials.morningstar.com/financials/getFinancePart.html?&callback=jsonp1450279445504&t=XNAS:SBUX&region=usa&culture=en-US&cur=&order=asc&_=1450279445578'
response = requests.get(url)

# extract the HTML under the "componentData"
data = json.loads(re.sub(r'([a-zA-Z_0-9\.]*\()|(\);?$)', '', response.content))["componentData"]

# parse HTML
soup = BeautifulSoup(data, "html.parser")
table = soup.find('table', attrs={'class': 'r_table1 text2'})
print(table.prettify())