如何使用硒python浏览分页?

时间:2019-06-21 18:19:59

标签: python selenium-webdriver

我正在从事网站自动化,我想浏览不同的页面,我认为该网站是使用Angular开发的。分页部分具有js函数,也可以在 onClick 函数上调用该函数。

HTML代码为:

<li ng-if="directionLinks" ng-class="{ disabled : pagination.current == pagination.last }" class="ng-scope"><a href="" ng-click="setCurrent(pagination.current + 1)" class="xh-highlight">›</a></li>

已编辑:

网站链接: https://jobee.pk/jobs-in-pakistan

到目前为止已尝试的代码:

from selenium import webdriver
import time
class JobeePK:
    def __init__(self):
        # self.url = ""
        pass
    def driver(self):
        driver = webdriver.Chrome()
        driver.maximize_window()
        time.sleep(1)
        return driver

    # https://www.rozee.pk/job/jsearch/q/all/fc/1185/fpn/
    def extractData(self,search_link, total_pages):
        driver = self.driver()
        driver.get(search_link)
        time.sleep(5)

        for page_number in range(0, total_pages):
            driver.find_element_by_css_selector()
            time.sleep(10)



if __name__ == '__main__':
    jb = JobeePK()
    url = "https://jobee.pk/jobs-in-pakistan"
    total_pages = 128
    jb.extractData(url, total_pages)

请向我建议解决此问题的任何解决方案。谢谢

1 个答案:

答案 0 :(得分:1)

在这种情况下,仔细查看页面总是很有趣,以了解数据的实际更新方式。

我这样做是在Firefox中打开控制台,然后查看了 XHR流量网络。

enter image description here

...有趣。该页面是从我们可以确定的端点获取其结果的。

它返回 json 数据,该数据很棒:

{'totalJobs': 2541,
 'jobs': [{'location': [{'jobLocationID': 0,
     'jobID': 24986,
     'countryID': 0,
     'country': 'Pakistan',
     'cityID': None,
     'cityText': 'Karachi',
     'jobShiftID': 0,
     'name': None}],
   'jobID': 24986,
   'jobIDEncrypted': '26cfb27ee6b2abad',
   'title': 'Marketing Officer - Freelancer',
   'jobDescription': '<p>We are growing, energetic, and highly-reputed Public Relation (PR) and Digital Marketing Agency.<br />\nCurrently, we are looking for ...

让我们用它来编写我们的脚本:

import requests
import math

#The scrapping function
def getJobs(pageNumber):

    #Defining the headers
    headers = {
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:67.0) Gecko/20100101 Firefox/67.0',
        'Accept': 'application/json, text/plain, */*',
        'Accept-Language': 'fr,fr-FR;q=0.8,en-US;q=0.5,en;q=0.3',
        'X-Requested-With': 'XMLHttpRequest',
        'Content-Type': 'application/json;charset=utf-8',
        'DNT': '1',
        'Connection': 'keep-alive',
        'Referer': 'https://jobee.pk/jobs-in-pakistan',
        'Pragma': 'no-cache'      
    }

    #Setting the right params for the request we will make, pageSize is set to 200 (results by page)
    data = {"model":{"titles":[],"cities":[],"shifts":[],"experinces":[],"careerLevels":[],"functionalAreas":[],"genders":[],"industries":[],"degreeLevels":[],"companies":[]},"pageNumber":1,"pageSize":200}

    #Updating the page number
    data['pageNumber'] = pageNumber
    data = json.dumps(data)

    #Collecting the results
    response = requests.post('https://jobee.pk/job/jobsearch', headers=headers, data=data)

    #Just in case an error shows up
    try:
        return json.loads(response.content)
    except:
        return {'jobs': []}

#Then lets get the page numbers from page 1        
data = getJobs(1)
totalJobs = data['totalJobs']
number_of_pages = math.ceil(totalJobs /200)

#Initializing our job list
jobs_list = []

#Looping through the pages
for pageNumber in range(1,number_of_pages + 1):
    results  = getJobs(pageNumber)

    #If no results we end the loop
    if len(result) == 0: 
        break
    else:
        #We append the results in the ['job'] key to append it to our list
        jobs_list += results['jobs']
        print ('Page', pageNumber,'-', len(jobs_list), "jobs collected")

#Lets have a look to the data into a dataframe
df = pd.DataFrame(jobs_list)
print(df)

输出

Page 1 - 200 jobs collected
Page 2 - 400 jobs collected
Page 3 - 600 jobs collected
...

+----+----------------------+--------------------+-------------+----------------------+------------------+----------------------------------------------------+--------+-------------------+----------------------------------------------------+----------------+--------------------+--------------------------+--------------------------+----------------+----------------------------------------------------+--------------------------------------------+----------------------------------------+-----------+
|    |    appliedByDate     |    companyName     | experience  |     expiredDate      | isSalaryVisible  |                  jobDescription                    | jobID  |  jobIDEncrypted   |                     location                       |     logo       | numberOfPositions  |        postDate          |       publishDate        |  salaryRange   |                      skills                        |                   title                    |     titleWithoutSpecialCharacters      | viewCount |
+----+----------------------+--------------------+-------------+----------------------+------------------+----------------------------------------------------+--------+-------------------+----------------------------------------------------+----------------+--------------------+--------------------------+--------------------------+----------------+----------------------------------------------------+--------------------------------------------+----------------------------------------+-----------+
| 0  | 0001-01-01T00:00:00  | Custom House       | Fresh       | 2019-09-19T00:00:00  | True             | <p>We require Mean Stack Developer Interns who...  | 27925  | a0962bea0bc174a1  | [{'jobLocationID': 0, 'jobID': 27925, 'country...  | 14564Logo.jpg  |                 3  | 2019-06-21T14:04:01.363  | 2019-06-21T19:26:24.213  | 5000 - 10000   | [AngularJs, Mongo DB, JavaScript, Node Js, Mea...  | Mean Stack Developer - Intern              | Mean-Stack-Developer-Intern            |        10 |
| 1  | 0001-01-01T00:00:00  | Custom House       | Fresh       | 2019-09-19T00:00:00  | True             | <p>We requires SEO, Digital Marketing and Grap...  | 27924  | 81e4e7f7d672dffd  | [{'jobLocationID': 0, 'jobID': 27924, 'country...  | 14564Logo.jpg  |                 2  | 2019-06-21T14:00:26.45   | 2019-06-21T19:25:04.493  | 5000 - 10000   | [Graphic Design, Search Engine Optimization (S...  | SEO Executive / Graphic Designer - Intern  | SEO-Executive-Graphic-Designer-Intern  |        10 |
| 2  | 0001-01-01T00:00:00  | Printoscan Lahore  | 1 Year      | 2019-09-19T00:00:00  | True             | <p>We require an <strong>Accounts Assistant / ...  | 27923  | 137a257e9e5bbb5d  | [{'jobLocationID': 0, 'jobID': 27923, 'country...  | None           |                 1  | 2019-06-21T13:59:37.373  | 2019-06-21T19:19:07.36   | 15000 - 20000  | [Accounts Services, Administrative Skills, Acc...  | Accounts Assistant / Administrator         | Accounts-Assistant-Administrator       |         6 |
+----+----------------------+--------------------+-------------+----------------------+------------------+----------------------------------------------------+--------+-------------------+----------------------------------------------------+----------------+--------------------+--------------------------+--------------------------+----------------+----------------------------------------------------+--------------------------------------------+----------------------------------------+-----------+

这就是我们想要的。