Web从汤文件中抓取相关信息

时间:2020-11-09 12:57:39

标签: web-scraping beautifulsoup

我正在尝试抓取此特定网址,以获取有关分支机构/自动柜员机名称和位置地址的信息。

url="https://www.bankmayapada.com/en/contactus/location-information"

但是,我得到的汤文件非常混乱,我无法弄清楚如何提取所需的信息。

我需要的信息是分行/自动取款机名称及其对应的地址。现在,我只是在弄清汤文件的结构。

import re
import requests
from bs4 import BeautifulSoup

page = requests.get(url)

soup = BeautifulSoup(page.text, 'html.parser')
print(soup.prettify())

1 个答案:

答案 0 :(得分:2)

您可以通过一个POST请求来获取该表的数据。有趣的事实,不需要有效载荷!

方法如下:

import requests
from bs4 import BeautifulSoup

page = requests.post("https://myapps.bankmayapada.com/frontend/IN/lokasi.aspx").text
rows = BeautifulSoup(page, "html.parser").find_all("tr", {"class": "dxgvDataRow"})

branch_location_data = []
for row in rows:
    province, area, location = row.find_all("td")
    branch_location_data.append(
        [
            province.getText(strip=True),  # province column
            area.getText(strip=True),  # area column
            location.find("b").getText(strip=True),  # Branch name
            " ".join(
                d.getText() for d in location.find_all("div")  # branch address
                if not d.getText().startswith(("Tel", "Fax"))  # skipping Phone & Fax info
            ),
        ]
    )
for branch in branch_location_data:
    print(branch)

输出:

['DKI JAKARTA', 'Jakarta Barat', 'Kantor Capem Citra Garden 2', 'Rukan Citra Niaga Blok A-7 Jl. Utan Jati - Kalideres Jakarta - DKI  Jakarta']
['DKI JAKARTA', 'Jakarta Barat', 'Kantor Capem Puri Indah', 'Jl. Puri Indah Raya Blok I No. 2 Jakarta 11610 - DKI  Jakarta']
['DKI JAKARTA', 'Jakarta Barat', 'Kantor Capem Pasar Pagi Asemka', 'Jl. Pasar Pagi No. 84 Jakarta - DKI  Jakarta']
['DKI JAKARTA', 'Jakarta Barat', 'Kantor Capem Tanjung Duren', 'Jl. Tanjung Duren No. 91 B Jakarta 11470 - DKI  Jakarta']
['DKI JAKARTA', 'Jakarta Barat', 'Kantor Capem Meruya', 'Jl. Meruya Ilir Raya No. 82 G Jakarta - DKI  Jakarta']
['DKI JAKARTA', 'Jakarta Barat', 'Kantor Capem Jembatan Lima', 'Jl. KH Moch. Mansyur No. 24 A Jakarta - DKI  Jakarta']
and so on...