从AJAX onclick弹出式抓取内容

时间:2014-09-27 20:24:09

标签: python ajax web-scraping beautifulsoup python-requests

我正在尝试使用Python从这个页面中获取信息:https://j2c-com.com/Euronaval14/catalogueWeb/catalogue.php?lang=gb。当您点击个别参展商的名字时,我对弹出窗口特别感兴趣。具有挑战性的部分是它使用大量的JavaScript来调用AJAX来加载数据。

我在点击参展商时检查了网络电话,看来AJAX呼叫转到了这个URL(列表中的第一个参展商,“AIAD和MOD ITALY”):https://j2c-com.com/Euronaval14/catalogueWeb/ajaxSociete.php?cle=D000365D000365&rnd=0.005115277832373977

我了解cle参数的来源(带有id标记的<span>),然而,我不太明白的是rnd参数的位置是派生的。它只是一个随机数吗?我尝试为每个请求提供一个随机数,但返回的html缺少弹出窗口的实际内容。

这让我相信rnd属性不是随机数,或者我需要某种类型的cookie,以便在请求中提供实际数据。

到目前为止,这是我的代码,我正在使用Requests和BeautifulSoup来解析html:

import random
import decimal
import requests
from bs4 import BeautifulSoup

#base_url = 'https://j2c-com.com/Euronaval14/catalogueWeb/catalogue.php?lang=gb'
base_url = 'https://j2c-com.com/Euronaval14/catalogueWeb/cataloguerecherche.php?listeFavoris=&typeRecherche=1&typeRechSociete=&typeSociete=&typeMarque=&typeDescriptif=&typeActivite=&choixSociete=&choixPays=&choixActivite=&choixAgent=&choixPavillon=&choixZoneExpo=&langue=gb&rnd=0.1410133063327521'


def generate_random_number(i,d):
    "Produce a random between 0 and 1, with 16 decimal digits"
    return str(decimal.Decimal('%d.%d' % (random.randint(0,i),random.randint(0,d))))



r = requests.get(base_url)
soup = BeautifulSoup(r.text)

table = soup.find('table', {'id':'tableResultat'})

trs = table.findAll('tr')


for tr in trs:
    span = tr.find('span')
    cle = span.get('id')

    url = 'https://j2c-com.com/Euronaval14/catalogueWeb/ajaxSociete.php?cle=' + cle + '&rnd=' + generate_random_number(0,9999999999999999)
    pop = requests.post(url)

    print url
    print pop.text

    break

你能帮我理解如何成功捕捉弹出数据,或者我做错了吗?提前致谢!

1 个答案:

答案 0 :(得分:2)

rnd参数无关。它完全是随机的,并由Math.random() js函数填充。

正如您所怀疑的那样,它大概是cookiesPHPSESSID cookie对于随后的每个请求都是至关重要的。只需启动requests.Session()并将其用于您提出的每个请求:

  

Session对象允许您跨越某些参数   要求。它还会在所有请求中保留cookie   会话实例。

...

# start session
session = requests.Session()

r = session.get(base_url)
soup = BeautifulSoup(r.text)

table = soup.find('table', {'id':'tableResultat'})
trs = table.findAll('tr')

for tr in trs:
    span = tr.find('span')
    cle = span.get('id')

    url = 'https://j2c-com.com/Euronaval14/catalogueWeb/ajaxSociete.php?cle=' + cle + '&rnd=' + generate_random_number(0,9999999999999999)
    pop = session.post(url)  # <-- the POST request here contains cookies returned by the first GET call

    print url
    print pop.text

    break

它打印(请参阅HTML填充了所需的数据):

https://j2c-com.com/Euronaval14/catalogueWeb/ajaxSociete.php?cle=D000365D000365&rnd=0.1625497943120751
<table class='divAdresse'>
    <tr>
        <td class='ficheAdresse' valign='top'>Via Nazionale 54<br>IT-00184 - Roma<br><img
                src='../../intranetJ2C/images/flags/IT.gif' style='margin-right:5px;'>ITALY<br><br>Phone: +39 06 488
            0247 | Fax: +39 06 482 74 76<br><br>Website: <a href='http://www.aiad.it' target='_new'>www.aiad.it</a></td>
    </tr>
</table>
<br>
<b class="divMarque">Contact:</b><br>
<font class="ficheAdresse"> Carlo Festucci - Secretary General<br>
<a href="mailto:c.festucci@aiad.it">c.festucci@aiad.it</a></font>
<br><br>
<div id='divTexte' class='ficheTexte'></div>

UPD。

你没有得到表中其他参展商的结果的原因很难解释,但这里的主要观点是当你点击浏览器中的行时模拟所有随后的ajax请求被调用:

import random
import decimal
import requests
from bs4 import BeautifulSoup

base_url = 'https://j2c-com.com/Euronaval14/catalogueWeb/cataloguerecherche.php?listeFavoris=&typeRecherche=1&typeRechSociete=&typeSociete=&typeMarque=&typeDescriptif=&typeActivite=&choixSociete=&choixPays=&choixActivite=&choixAgent=&choixPavillon=&choixZoneExpo=&langue=gb&rnd=0.1410133063327521'
fiche_url = 'https://j2c-com.com/Euronaval14/catalogueWeb/fiche.php'
reload_url = 'https://j2c-com.com/Euronaval14/catalogueWeb/reload.php'
data_url = 'https://j2c-com.com/Euronaval14/catalogueWeb/ajaxSociete.php'


def generate_random_number(i,d):
    "Produce a random between 0 and 1, with 16 decimal digits"
    return str(decimal.Decimal('%d.%d' % (random.randint(0, i),random.randint(0, d))))


# start session
session = requests.Session()

r = session.get(base_url)
soup = BeautifulSoup(r.content)
for span in soup.select('table#tableResultat tr span'):
    cle = span.get('id')

    session.post(reload_url)
    session.post(fiche_url, data={'page': 'page:catalogue',
                                  'pasFavori': '1',
                                  'listeFavoris': '',
                                  'cle': cle,
                                  'stand': '',
                                  'rnd': generate_random_number(0, 9999999999999999)})
    session.post(reload_url)
    pop = session.post(data_url, data={'cle': cle,
                                       'rnd': generate_random_number(0, 9999999999999999)})

    print pop.text

打印:

<table class='divAdresse'><tr><td class='ficheAdresse' valign='top'>Via Nazionale 54<br>IT-00184 - Roma<br><img src='../../intranetJ2C/images/flags/IT.gif' style='margin-right:5px;'>ITALY<br><br>Phone: +39 06 488 0247 | Fax: +39 06 482 74 76<br><br>Website: <a href='http://www.aiad.it' target='_new'>www.aiad.it</a></td></tr></table><br><b class="divMarque">Contact:</b><br><font class="ficheAdresse"> Carlo Festucci - Secretary General<br><a href="mailto:c.festucci@aiad.it">c.festucci@aiad.it</a></font><br><br><div id='divTexte' class='ficheTexte'></div>
<table class='divAdresse'><tr><td class='ficheAdresse' valign='top'>An der Faehre 2<br>27809 - Lemwerder<br><img src='../../intranetJ2C/images/flags/DE.gif' style='margin-right:5px;'>GERMANY<br><br>Phone: +49 421 673 30 | Fax: +49 421 673 3115<br><br>Website: <a href='http://www.abeking.com' target='_new'>www.abeking.com</a></td></tr></table><br><b class="divMarque">Contact:</b><br><font class="ficheAdresse"> Thomas Haake - Sales Director Navy</font><br><br><div id='divTexte' class='ficheTexte'></div>
<table class='divAdresse'><tr><td class='ficheAdresse' valign='top'>Mohamed Bin Khalifa Street (street 15)<br>PO Box 107241<br>107241 - Abu Dhabi<br><img src='../../intranetJ2C/images/flags/AE.gif' style='margin-right:5px;'>UNITED ARAB EMIRATES<br><br>Phone: +971 2 445 5551 | Fax: +971 2 445 0644</td></tr></table><br><b class="divMarque">Contact:</b><br><font class="ficheAdresse"> Pierre Baz - Business Development<br><a href="mailto:pierre.baz@abudhabimar.com">pierre.baz@abudhabimar.com</a></font><br><br><div id='divTexte' class='ficheTexte'></div>
...