Python-使用HTML标签进行网页抓取

时间:2018-06-24 19:24:56

标签: python-3.x web-scraping beautifulsoup urllib2

我正在尝试抓取网页以列出URL https://careers.microsoft.com/us/en/search-results?rk=l-hyderabad中发布的作业

有关图像的网页检查,请参见图片Web inspect

通过网页检查可以观察到以下情况:

  1. 列出的每个作业都在带有class =“ jobs-list-item”的HTML li中。 Li在li内的父Div中包含以下html标签和数据

    data-ph-at-job-title-text =“软件工程师II”, data-ph-at-job-category-text =“工程”, data-ph-at-job-post-date-text =“ 2018-03-19T16:33:00”。

  2. 父Div内具有class =“ information”的第一个子Div具有带有URL的HTML href =“ https://careers.microsoft.com/us/en/job/406138/Software-Engineer-II”

  3. 父级Div中具有class =“ description au-target”的第三个子级Div的职位描述简短

我的要求是为每个工作提取以下信息

  1. 职位名称
  2. 职位类别
  3. 职位发布日期
  4. 职位发布时间
  5. 工作网址
  6. 职位简短说明

我尝试按照Python代码抓取网页,但无法提取所需的信息。 (请忽略下面代码中显示的缩进)

import requests
from bs4 import BeautifulSoup
def ms_jobs():
url = 'https://careers.microsoft.com/us/en/search-results?rk=l-hyderabad'
resp = requests.get(url)

if resp.status_code == 200:
print("Successfully opened the web page")
soup = BeautifulSoup(resp.text, 'html.parser')
print(soup)
else:
print("Error")

ms_jobs()

1 个答案:

答案 0 :(得分:1)

如果要通过请求执行此操作,则需要对站点进行反向工程。在Chrome中打开开发工具,选择“网络”标签,然后填写表格。

这将向您显示站点如何加载数据。如果您在该站点中进行挖掘,将会发现它通过对以下端点进行POST来捕获数据:https://careers.microsoft.com/widgets。它还显示了站点使用的有效负载。该网站使用Cookie,因此您要做的就是创建一个会话,以保留cookie,获取其中一个并复制/粘贴有效载荷。

通过这种方式,您将能够提取javascript提取的json-data来动态填充网站。

下面是一个实际的例子。左键只是在您认为合适的情况下解析json。

import requests
from pprint import pprint

# create a session to grab a cookie from the site
session = requests.Session()
r = session.get("https://careers.microsoft.com/us/en/")

# these params are the ones that the dev tools show that site sets when using the website form
payload = {
    "lang":"en_us",
    "deviceType":"desktop",
    "country":"us",
    "ddoKey":"refineSearch",
    "sortBy":"",
    "subsearch":"",
    "from":0,
    "jobs":"true",
    "counts":"true",
    "all_fields":["country","state","city","category","employmentType","requisitionRoleType","educationLevel"],
    "pageName":"search-results",
    "size":20,
    "keywords":"",
    "global":"true",
    "selected_fields":{"city":["Hyderabad"],"country":["India"]},
    "sort":"null",
    "locationData":{}
}

# this is the endpoint the site uses to fetch json
url = "https://careers.microsoft.com/widgets"
r = session.post(url, json=payload)
data = r.json()
job_list = data['refineSearch']['data']['jobs']

# the job_list will hold 20 jobs (you can se the parameter in the payload to a higher number if you please - I tested 100, that returned 100 jobs
job = job_list[0]
pprint(job)

干杯。