我有一个玻璃门链接,我正尝试通过request.get()
访问
https://www.glassdoor.com/Job/jobs.htm?sc.generalKeyword=%22teaching%22&sc.locationSeoString=new+york&locId=1132348&locT=C
我注意到,当我单击下一页时,会添加一个lo_IP{page_number}.htm
。例如:
https://www.glassdoor.com/Job/jobs.htm?sc.generalKeyword=%22teaching%22&sc.locationSeoString=new+york&locId=1132348&lo_IP4.htm for page 4。
但是当我直接转到该链接(例如第4页)时,并没有带我到第4页。是否有办法转到第n页?
pages= 2
for x in range(1, pages):
page_url = "https://www.glassdoor.com/Job/jobs.htm?sc.generalKeyword=%22teaching%22&sc.locationSeoString=new+york&locId=1132348&lo_IP{}.htm".format(x)
headers = {"User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36"}
page = requests.get(page_url, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
答案 0 :(得分:1)
通过:
<li class="page">
<a href="/Job/jobs.htm?sc.generalKeyword=%22teaching%22&sc.locationSeoString=new+york&locId=1132348&locT=C&p=4">
<span class="link">4</span>
</a>
</li>
从逻辑上讲&p=n
将转到第n页。
所以要获得第n页
url = f'https://www.glassdoor.com/Job/jobs.htm?sc.generalKeyword="teaching"&sc.locationSeoString=new+york&locId=1132348&locT=C&p={n}'
Origin网站由JS工作。它只是请求数据并更新url和页面。因此,https://www.glassdoor.com/Job/jobs.htm?sc.generalKeyword=%22teaching%22&sc.locationSeoString=new+york&locId=1132348&lo_IP4.htm只是它放入在网址上的内容。