pandas.read_html只返回一个表

时间:2014-11-24 20:54:12

标签: python html amazon-web-services pandas

我尝试用pandas阅读ec2定价表。基于documentation我期望DataFrames列表,但是有一个表作为列表。

代码示例

import pandas
link = 'http://aws.amazon.com/ec2/pricing/' 
data = pandas.read_html(link)
print type(data)
print data[0]

输出

<type 'list'>
                               0                 1                2
0  Reserved Instance Volume Discounts               NaN              NaN
1            Total Reserved Instances  Upfront Discount  Hourly Discount
2                  Less than $250,000                0%               0%
3              $250,000 to $2,000,000                5%               5%
4            $2,000,000 to $5,000,000               10%              10%
5                More than $5,000,000        Contact Us       Contact Us

环境:

  • Ubuntu 14.10
  • python 2.7.8
  • pandas 0.14.1

1 个答案:

答案 0 :(得分:1)

http://aws.amazon.com/ec2/pricing/使用JavaScript填写表格中的数据。

与将GUI浏览器指向链接时所看到的不同,如果使用urllib2下载HTML,则数据会丢失:

import urllib2
response = urllib2.urlopen(link)
content = resonse.read()

(然后搜索<table>标签的内容。)

要处理JavaScript,您需要一个像Selenium这样的自动浏览器引擎, 或WebKit或Spidermonkey。

以下是使用Selenium的解决方案:

import selenium.webdriver as webdriver
import contextlib
import pandas as pd
@contextlib.contextmanager
def quitting(thing):
    yield thing
    thing.quit()

with quitting(webdriver.Firefox()) as driver:
    link = 'http://aws.amazon.com/ec2/pricing/' 
    driver.get(link)
    content = driver.page_source
    with open('/tmp/out.html', 'wb') as f:
        f.write(content.encode('utf-8'))
    data = pd.read_html(content)
    print len(data)

产量

238