我正在从事网站自动化,我想浏览不同的页面,我认为该网站是使用Angular开发的。分页部分具有js函数,也可以在 onClick 函数上调用该函数。
HTML代码为:
<li ng-if="directionLinks" ng-class="{ disabled : pagination.current == pagination.last }" class="ng-scope"><a href="" ng-click="setCurrent(pagination.current + 1)" class="xh-highlight">›</a></li>
已编辑:
网站链接: https://jobee.pk/jobs-in-pakistan
到目前为止已尝试的代码:
from selenium import webdriver
import time
class JobeePK:
def __init__(self):
# self.url = ""
pass
def driver(self):
driver = webdriver.Chrome()
driver.maximize_window()
time.sleep(1)
return driver
# https://www.rozee.pk/job/jsearch/q/all/fc/1185/fpn/
def extractData(self,search_link, total_pages):
driver = self.driver()
driver.get(search_link)
time.sleep(5)
for page_number in range(0, total_pages):
driver.find_element_by_css_selector()
time.sleep(10)
if __name__ == '__main__':
jb = JobeePK()
url = "https://jobee.pk/jobs-in-pakistan"
total_pages = 128
jb.extractData(url, total_pages)
请向我建议解决此问题的任何解决方案。谢谢
答案 0 :(得分:1)
在这种情况下,仔细查看页面总是很有趣,以了解数据的实际更新方式。
我这样做是在Firefox中打开控制台,然后查看了 XHR
流量网络。
...有趣。该页面是从我们可以确定的端点获取其结果的。
它返回 json
数据,该数据很棒:
{'totalJobs': 2541,
'jobs': [{'location': [{'jobLocationID': 0,
'jobID': 24986,
'countryID': 0,
'country': 'Pakistan',
'cityID': None,
'cityText': 'Karachi',
'jobShiftID': 0,
'name': None}],
'jobID': 24986,
'jobIDEncrypted': '26cfb27ee6b2abad',
'title': 'Marketing Officer - Freelancer',
'jobDescription': '<p>We are growing, energetic, and highly-reputed Public Relation (PR) and Digital Marketing Agency.<br />\nCurrently, we are looking for ...
让我们用它来编写我们的脚本:
import requests
import math
#The scrapping function
def getJobs(pageNumber):
#Defining the headers
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:67.0) Gecko/20100101 Firefox/67.0',
'Accept': 'application/json, text/plain, */*',
'Accept-Language': 'fr,fr-FR;q=0.8,en-US;q=0.5,en;q=0.3',
'X-Requested-With': 'XMLHttpRequest',
'Content-Type': 'application/json;charset=utf-8',
'DNT': '1',
'Connection': 'keep-alive',
'Referer': 'https://jobee.pk/jobs-in-pakistan',
'Pragma': 'no-cache'
}
#Setting the right params for the request we will make, pageSize is set to 200 (results by page)
data = {"model":{"titles":[],"cities":[],"shifts":[],"experinces":[],"careerLevels":[],"functionalAreas":[],"genders":[],"industries":[],"degreeLevels":[],"companies":[]},"pageNumber":1,"pageSize":200}
#Updating the page number
data['pageNumber'] = pageNumber
data = json.dumps(data)
#Collecting the results
response = requests.post('https://jobee.pk/job/jobsearch', headers=headers, data=data)
#Just in case an error shows up
try:
return json.loads(response.content)
except:
return {'jobs': []}
#Then lets get the page numbers from page 1
data = getJobs(1)
totalJobs = data['totalJobs']
number_of_pages = math.ceil(totalJobs /200)
#Initializing our job list
jobs_list = []
#Looping through the pages
for pageNumber in range(1,number_of_pages + 1):
results = getJobs(pageNumber)
#If no results we end the loop
if len(result) == 0:
break
else:
#We append the results in the ['job'] key to append it to our list
jobs_list += results['jobs']
print ('Page', pageNumber,'-', len(jobs_list), "jobs collected")
#Lets have a look to the data into a dataframe
df = pd.DataFrame(jobs_list)
print(df)
输出
Page 1 - 200 jobs collected
Page 2 - 400 jobs collected
Page 3 - 600 jobs collected
...
+----+----------------------+--------------------+-------------+----------------------+------------------+----------------------------------------------------+--------+-------------------+----------------------------------------------------+----------------+--------------------+--------------------------+--------------------------+----------------+----------------------------------------------------+--------------------------------------------+----------------------------------------+-----------+
| | appliedByDate | companyName | experience | expiredDate | isSalaryVisible | jobDescription | jobID | jobIDEncrypted | location | logo | numberOfPositions | postDate | publishDate | salaryRange | skills | title | titleWithoutSpecialCharacters | viewCount |
+----+----------------------+--------------------+-------------+----------------------+------------------+----------------------------------------------------+--------+-------------------+----------------------------------------------------+----------------+--------------------+--------------------------+--------------------------+----------------+----------------------------------------------------+--------------------------------------------+----------------------------------------+-----------+
| 0 | 0001-01-01T00:00:00 | Custom House | Fresh | 2019-09-19T00:00:00 | True | <p>We require Mean Stack Developer Interns who... | 27925 | a0962bea0bc174a1 | [{'jobLocationID': 0, 'jobID': 27925, 'country... | 14564Logo.jpg | 3 | 2019-06-21T14:04:01.363 | 2019-06-21T19:26:24.213 | 5000 - 10000 | [AngularJs, Mongo DB, JavaScript, Node Js, Mea... | Mean Stack Developer - Intern | Mean-Stack-Developer-Intern | 10 |
| 1 | 0001-01-01T00:00:00 | Custom House | Fresh | 2019-09-19T00:00:00 | True | <p>We requires SEO, Digital Marketing and Grap... | 27924 | 81e4e7f7d672dffd | [{'jobLocationID': 0, 'jobID': 27924, 'country... | 14564Logo.jpg | 2 | 2019-06-21T14:00:26.45 | 2019-06-21T19:25:04.493 | 5000 - 10000 | [Graphic Design, Search Engine Optimization (S... | SEO Executive / Graphic Designer - Intern | SEO-Executive-Graphic-Designer-Intern | 10 |
| 2 | 0001-01-01T00:00:00 | Printoscan Lahore | 1 Year | 2019-09-19T00:00:00 | True | <p>We require an <strong>Accounts Assistant / ... | 27923 | 137a257e9e5bbb5d | [{'jobLocationID': 0, 'jobID': 27923, 'country... | None | 1 | 2019-06-21T13:59:37.373 | 2019-06-21T19:19:07.36 | 15000 - 20000 | [Accounts Services, Administrative Skills, Acc... | Accounts Assistant / Administrator | Accounts-Assistant-Administrator | 6 |
+----+----------------------+--------------------+-------------+----------------------+------------------+----------------------------------------------------+--------+-------------------+----------------------------------------------------+----------------+--------------------+--------------------------+--------------------------+----------------+----------------------------------------------------+--------------------------------------------+----------------------------------------+-----------+
这就是我们想要的。