Question

我的python中的解析脚本出了问题。香港专业教育学院已经在另一个页面（雅虎财务）尝试过，它运行良好。然而在晨星上它不起作用。我在终端中得到了错误＆＃34; NoneObject＆＃34;表变量。我想这与moriningstar网站的结构有关，但我不确定。 Maybey somneone可以告诉我出了什么问题。或者由于晨星网站的网站结构使用我的简单脚本，这是不可能的？

直接来自morningstar的简单csv导出不是解决方案，因为我想将该脚本用于其他没有此功能的网站。

import requests
import csv
from bs4 import BeautifulSoup
from lxml import html

url = 'http://financials.morningstar.com/ratios/r.html?t=SBUX&region=USA&culture=en_US'
response = requests.get(url)
html = response.content

soup = BeautifulSoup(html)
table = soup.find('table', attrs={'class': 'r_table1 text2'})

print table.prettify() #debugging

list_of_rows = []
for row in table.findAll('tr'):
   list_of_cells =[]

   for cell in row.findAll(['th','td']):
        text = cell.text.replace('&nbsp;', '')
        list_of_cells.append(text)
   list_of_rows.append(list_of_cells)
print list_of_rows #debugging

outfile = open("./test.csv", "wb")
writer = csv.writer(outfile)
writer.writerows(list_of_rows)

Answer 1

该表是动态加载，对端点进行单独的XHR调用，该端点将返回JSONP响应。模拟该请求，从JSONP响应中提取JSON字符串，使用json加载它，从componentData密钥中提取HTML并使用BeautifulSoup加载：

import json
import re

import requests
from bs4 import BeautifulSoup

# make a request
url = 'http://financials.morningstar.com/financials/getFinancePart.html?&callback=jsonp1450279445504&t=XNAS:SBUX&region=usa&culture=en-US&cur=&order=asc&_=1450279445578'
response = requests.get(url)

# extract the HTML under the "componentData"
data = json.loads(re.sub(r'([a-zA-Z_0-9\.]*\()|(\);?$)', '', response.content))["componentData"]

# parse HTML
soup = BeautifulSoup(data, "html.parser")
table = soup.find('table', attrs={'class': 'r_table1 text2'})
print(table.prettify())

Python webscraping - NoneObeject Failure - 破坏HTML？

1 个答案: