Question

我想知道是否可以使用beautifulsoup来抓取基于ajax调用加载表的网站。

下面是我用来访问包含表

的div的python代码

table = bs.find(lambda tag: tag.name=='div' and tag.has_key('id') and tag['id']=="id+name")

当执行该操作时，我得到一个空的div <div id="id+name"></div>

java script / ajax函数看起来像这样

function getTable(){
    $.ajax({
        type: "POST",
        url: "<some processing file .asmx>",
        contentType: "application/json; charset=utf-8",
        dataType:"json",
        success: function(msg){
            $('#table+id').html(msg.d);
        }
    });

我认为我变得空白，因为它试图在页面处理之前刮掉表格。这可以用漂亮的汤来处理吗？

Answer 1

BeautifulSoup只是一个HTML解析器。您需要执行该javascript调用和/或发出POST请求。

基本上，您有两种选择：

使用使用真实浏览器的工具，例如selenium。这样你就可以让浏览器完成加载页面和为你执行javascript的所有工作。您可以使用find_element_by_id()来访问该元素。

使用urllib2或requests发出POST请求并解析结果。根据您提供的JavaScript代码 - 回复采用JSON格式，其中包含内容表的HTML代码：

import json

from bs4 import BeautifulSoup
import requests

URL = "<some processing file .asmx>"
response = requests.post(URL)
data = json.loads(response.content)

div = BeautifulSoup(data['d'])

UPD（获取表格的实际工作代码）：

import json
from bs4 import BeautifulSoup

import requests


URL = 'http://www.ise.com/MarketDataService.asmx/ISE_Get_IntraDay_Summary'
response = requests.post(URL, headers={'Content-Type': 'application/json; charset=utf-8'})
data = json.loads(response.content)

soup = BeautifulSoup(data['d'])
for row in soup('tr'):
    print " | ".join(cell.text for cell in row('td'))

打印：

All Securities  | All Equities Only | All Indices & ETF Only
16:15 | 244,754 | 258,519 | 503,273 | 95 | 192,025 | 85,778 | 277,803 | 224 | 52,726 | 172,741 | 225,467 | 31
16:10 | 244,473 | 260,881 | 505,354 | 94 | 192,025 | 85,778 | 277,803 | 224 | 52,445 | 175,103 | 227,548 | 30
15:50 | 232,697 | 227,149 | 459,846 | 102 | 182,351 | 81,672 | 264,023 | 223 | 50,343 | 145,477 | 195,820 | 35 
...

刮刮使用AJAX Post填充数据的网站

1 个答案: