从网页中抓取“隐藏”表格

时间:2021-03-19 14:24:37

标签: python selenium web-scraping beautifulsoup python-requests

我正在尝试通过以下网址获取表格:https://www.agenas.gov.it/covid19/web/index.php?r=site%2Ftab2。 我尝试阅读 qith requests 和 BeautifulSoup:

from bs4 import BeautifulSoup as bs
import requests
s = requests.session()
req = s.get('https://www.agenas.gov.it/covid19/web/index.php?r=site%2Ftab2', headers={
"User-Agent" : "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) "
               "Chrome/51.0.2704.103 Safari/537.36"})
soup = bs(req.content)
table = soup.find('table')

然而,我只得到表格的标题。

<table class="table">
<caption class="pl8">Ricoverati e posti letto in area non critica e terapia intensiva.</caption>
<thead>
<tr>
<th class="cella-tabella-sm align-middle text-center" scope="col">Regioni</th>
<th class="cella-tabella-sm bg-blu align-middle text-center" scope="col">Ricoverati in Area Non Critica</th>
<th class="cella-tabella-sm bg-blu align-middle text-center" scope="col">PL in Area Non Critica</th>
<th class="cella-tabella-sm bg-blu align-middle text-center" scope="col">Ricoverati in Terapia intensiva</th>
<th class="cella-tabella-sm bg-blu align-middle text-center" scope="col">PL in Terapia Intensiva</th>
<th class="cella-tabella-sm bg-blu align-middle text-center" scope="col">PL Terapia Intensiva attivabili</th>
</tr>
</thead>
<tbody id="tab2_body">
</tbody>
</table>

所以我尝试使用我认为表格所在的 URL:https://Agenas:tab2-19@www.agenas.gov.it/covid19/web/index.php?r =json%2Ftab2 。 但在这种情况下,我总是得到 401 状态代码,甚至如先前请求中所示添加标头用户名和密码。例如:

 requests.get('https://Agenas:tab2-19@www.agenas.gov.it/covid19/web/index.php?r=json%2Ftab2', headers={'username':'Agenas', 'password':'tab2-19'
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36'})

知道如何解决这个问题吗? 谢谢。

1 个答案:

答案 0 :(得分:2)

headers 所需的那些“秘密”实际上嵌入在 <script> 标签中。因此,您可以将它们找出来,将它们解析为 JSON 并在 request headers 中使用。

方法如下:

import json
import re

import requests
from bs4 import BeautifulSoup

headers = {
    "user-agent": "Mozilla/5.0 (X11; Linux x86_64) "
                  "AppleWebKit/537.36 (KHTML, like Gecko) "
                  "Chrome/89.0.4389.90 Safari/537.36",
    "x-requested-with": "XMLHttpRequest",
}

with requests.Session() as s:
    end_point = "https://Agenas:tab2-19@www.agenas.gov.it/covid19/web/index.php?r=json%2Ftab2"
    regular_page = "https://www.agenas.gov.it/covid19/web/index.php?r=site%2Ftab2"

    html = s.get(regular_page, headers=headers).text
    soup = BeautifulSoup(html, "html.parser").find_all("script")[-1].string

    hacked_payload = json.loads(
        re.search(r"headers:\s({.*}),", soup, re.S).group(1).strip()
    )

    headers.update(hacked_payload)
    print(json.dumps(s.get(end_point, headers=headers).json(), indent=2))

输出:

[
  {
    "regione": "Abruzzo",
    "dato1": "667",
    "dato2": "1495",
    "dato3": "89",
    "dato4": "215",
    "dato5": "0"
  },
  {
    "regione": "Basilicata",
    "dato1": "164",
    "dato2": "426",
    "dato3": "12",
    "dato4": "88",
    "dato5": "13"
  },

and so on ...