URL不变时如何在网页抓取中遍历页面

时间:2020-11-10 12:24:26

标签: python selenium-webdriver web-scraping beautifulsoup

我想获取分支机构和ATM(仅)及其地址的列表。

我正在尝试刮擦:

url="https://www.ocbcnisp.com/en/hubungi-kami/lokasi-kami"

from bs4 import BeautifulSoup
from selenium import webdriver    
from selenium.webdriver.support.ui    
import WebDriverWait    
from selenium.webdriver.support    
import expected_conditions as EC    
from selenium.webdriver.common.by import By    
from selenium.common.exceptions import TimeoutException

driver = webdriver.Chrome()    
driver.get(URL)    
soup = BeautifulSoup(driver.page_source, 'html.parser')
driver.quit()

import re    
import pandas as pd

Branch_list=[]    
Address_list=[]

for i in soup.find_all('div',class_="ocbc-card ocbc-card--location"):
    Branch=soup.find_all('p',class_="ocbc-card__title")
    Address=soup.find_all('p',class_="ocbc-card__desc")
    
for j in Branch:
    j = re.sub(r'<(.*?)>', '', str(j))
    j = j.strip()
    Branch_list.append(j)

for k in Address:
    k = re.sub(r'<(.*?)>', '', str(k))
    k = k.strip()
    Address_list.append(k)

OCBC=pd.DataFrame()    
OCBC['Branch_Name']=Branch_list    
OCBC['Address']=Address_list

这在第一页上为我提供了必需的信息,但是我想在所有页面上都做得到。有人可以建议吗?

1 个答案:

答案 0 :(得分:1)

使用python尝试以下方法- requests 涉及请求时,需要简单,直接,可靠,快速且更少的代码。检查Google chrome浏览器的网络部分后,我已经从网站本身获取了API URL。

下面的脚本到底在做什么:

  1. 首先,它将使用使用大写字母,标头,有效负载和动态参数创建的API URL,然后执行POST请求。

  2. 有效负载是动态的,您可以在参数中传递任何有效值,并且每次您想从站点中获取内容时都会为您创建数据。(重要的是不要更改Page_No参数的值)。 / p>

  3. 获取数据脚本后,将使用json.loads库解析JSON数据。

  4. 最后,它将遍历每次迭代或页面中获取的地址列表中的所有地址,例如:-地址,名称,电话号码,传真,城市等,您可以根据需要修改这些属性。 / p>

    def scrap_atm_data():
    
    PAGE_NO = 1
    url = "https://www.ocbcnisp.com/api/sitecore/ContactUs/GetMapsList" #API URL
    
    headers = {
        'content-type': 'application/x-www-form-urlencoded',
        'cookie': 'ocbc#lang=en; ASP.NET_SessionId=xb3nal2u21pyh0rnlujvfo2p; sxa_site=OCBC; ROUTEID=.2; nlbi_1130533=goYxXNJYEBKzKde7Zh+2XAAAAADozEuZQihZvBGZfxa+GjRf; visid_incap_1130533=1d1GBKkkQPKgTx+24RCCe6CPql8AAAAAQUIPAAAAAAChaTReUWlHSyevgodnjCRO; incap_ses_1185_1130533=hofQMZCe9WmvOiUTXvdxEKGPql8AAAAAvac5PaS0noMc+UXHbHc1DA==; SC_ANALYTICS_GLOBAL_COOKIE=e0aa2fcca70c4d999a32fc1f74d09fc8|True; incap_ses_707_1130533=OcSGOGJw3joFLj7x/8TPCVuWql8AAAAAlY3z7ZcDzd/Kba5s5UgLPQ==',       
        }#header and type !Important to add both headers
    
    while True:
        print('Creating new payload data for page no : ' + str(PAGE_NO))
        payload = 'currPage=' + str(PAGE_NO) + '&query=&dsLocationResult=%7B76EE6530-2A27-46A7-8B32-52E3DAE19DC3%7D&itemId=%7BC59FD793-38C1-444C-9612-1E3A3019BED3%7D'
    
        response = requests.post(url, data=payload, headers=headers,verify=False)
        result = json.loads(response.text) #Parse result using JSON loads
        print('Created new payload now going to fetch data...')
    
        if len(result) == 0:
            break
        else:
            extracted_data = result['listItem']
            for data in extracted_data:
                print('-' * 100)
                print('Fetching data for -> ' , data['name'])
                print('Name : ', data['name'])
                print('Address : ', data['alamat'])
                print('City : ',data['city'])
                print('fax : ', data['fax'])
                print('Operating Hours : ',data['operation_hour'])
                print('Telephone Number : ',data['telp'])
                print('Location Type : ',data['type_location'])
                print('-' * 100)
            PAGE_NO += 1 #increment page number after each iteration to scrap more data
    scrap_atm_data()
    

Output