我尝试用pandas阅读ec2定价表。基于documentation我期望DataFrames列表,但是有一个表作为列表。
代码示例
import pandas
link = 'http://aws.amazon.com/ec2/pricing/'
data = pandas.read_html(link)
print type(data)
print data[0]
输出
<type 'list'>
0 1 2
0 Reserved Instance Volume Discounts NaN NaN
1 Total Reserved Instances Upfront Discount Hourly Discount
2 Less than $250,000 0% 0%
3 $250,000 to $2,000,000 5% 5%
4 $2,000,000 to $5,000,000 10% 10%
5 More than $5,000,000 Contact Us Contact Us
环境:
答案 0 :(得分:1)
http://aws.amazon.com/ec2/pricing/使用JavaScript填写表格中的数据。
与将GUI浏览器指向链接时所看到的不同,如果使用urllib2下载HTML,则数据会丢失:
import urllib2
response = urllib2.urlopen(link)
content = resonse.read()
(然后搜索<table>
标签的内容。)
要处理JavaScript,您需要一个像Selenium这样的自动浏览器引擎, 或WebKit或Spidermonkey。
以下是使用Selenium的解决方案:
import selenium.webdriver as webdriver
import contextlib
import pandas as pd
@contextlib.contextmanager
def quitting(thing):
yield thing
thing.quit()
with quitting(webdriver.Firefox()) as driver:
link = 'http://aws.amazon.com/ec2/pricing/'
driver.get(link)
content = driver.page_source
with open('/tmp/out.html', 'wb') as f:
f.write(content.encode('utf-8'))
data = pd.read_html(content)
print len(data)
产量
238