我正在尝试抓取此特定网址,以获取有关分支机构/自动柜员机名称和位置地址的信息。
url="https://www.bankmayapada.com/en/contactus/location-information"
但是,我得到的汤文件非常混乱,我无法弄清楚如何提取所需的信息。
我需要的信息是分行/自动取款机名称及其对应的地址。现在,我只是在弄清汤文件的结构。
import re
import requests
from bs4 import BeautifulSoup
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
print(soup.prettify())
答案 0 :(得分:2)
您可以通过一个POST
请求来获取该表的数据。有趣的事实,不需要有效载荷!
方法如下:
import requests
from bs4 import BeautifulSoup
page = requests.post("https://myapps.bankmayapada.com/frontend/IN/lokasi.aspx").text
rows = BeautifulSoup(page, "html.parser").find_all("tr", {"class": "dxgvDataRow"})
branch_location_data = []
for row in rows:
province, area, location = row.find_all("td")
branch_location_data.append(
[
province.getText(strip=True), # province column
area.getText(strip=True), # area column
location.find("b").getText(strip=True), # Branch name
" ".join(
d.getText() for d in location.find_all("div") # branch address
if not d.getText().startswith(("Tel", "Fax")) # skipping Phone & Fax info
),
]
)
for branch in branch_location_data:
print(branch)
输出:
['DKI JAKARTA', 'Jakarta Barat', 'Kantor Capem Citra Garden 2', 'Rukan Citra Niaga Blok A-7 Jl. Utan Jati - Kalideres Jakarta - DKI Jakarta']
['DKI JAKARTA', 'Jakarta Barat', 'Kantor Capem Puri Indah', 'Jl. Puri Indah Raya Blok I No. 2 Jakarta 11610 - DKI Jakarta']
['DKI JAKARTA', 'Jakarta Barat', 'Kantor Capem Pasar Pagi Asemka', 'Jl. Pasar Pagi No. 84 Jakarta - DKI Jakarta']
['DKI JAKARTA', 'Jakarta Barat', 'Kantor Capem Tanjung Duren', 'Jl. Tanjung Duren No. 91 B Jakarta 11470 - DKI Jakarta']
['DKI JAKARTA', 'Jakarta Barat', 'Kantor Capem Meruya', 'Jl. Meruya Ilir Raya No. 82 G Jakarta - DKI Jakarta']
['DKI JAKARTA', 'Jakarta Barat', 'Kantor Capem Jembatan Lima', 'Jl. KH Moch. Mansyur No. 24 A Jakarta - DKI Jakarta']
and so on...