使用Python进行的网络抓取,需要登录才能查看输出

时间:2019-07-16 08:30:58

标签: python-3.x web-scraping beautifulsoup python-requests mechanize

我正在尝试输出工作的薪水,但显示需要登录才能查看。我可以成功输出其他职位的描述,例如职位,公司,位置等。我尝试使用我的帐户登录并注销,但仍然显示登录以查看工资。 我的问题是,如何显示需要登录才能查看的薪水?需要有人帮助我。

import requests
from bs4 import BeautifulSoup
from mechanize import Browser
import http.cookiejar as cookielib

#creates browser
br = Browser()
#browser options
br.set_handle_robots(False)  #ignore robots
br.set_handle_refresh(False) #can sometimes hang without this
br.addheaders = [('User-Agent', 'Firefox')]
login_url = "https://myjobstreet.jobstreet.com.my/home/login.php"
cj = cookielib.CookieJar()
br.set_cookiejar(cj)
response = br.open('https://myjobstreet.jobstreet.com.my/home/login.php')
#view available forms
for f in br.forms():
    print(f)
br.select_form('login')
br.set_all_readonly(False)   #allows everything to be written to
br.form['login_id'] = 'my_id'
br.form['password'] = 'my_password'
#submit current form
br.submit()

r = requests.get(url, headers=headers, auth=('user', 'pass'))
soup = BeautifulSoup(r.text, 'lxml')
jobs = soup.find_all("div", {"class": "rRow"})
for job in jobs:
    try:
        salary = job.find_all("div", {"class": "rRowLoc"})
        job_salary = salary[0].text.strip()
    except IndexError:
        pass

    print("Salary: ", job_salary)

这是输出:

Job:  Sales Executive
Company:  Company
Location:  Earth
Salary:  Login to view salary

预期输出:

Job:  Sales Executive
Company:  Company
Location:  Earth
Salary:  1000 

2 个答案:

答案 0 :(得分:1)

您的代码不起作用,但是您的目标是从页面上刮下公司名称,职位,位置和薪水。

您可以使用requests进行登录。

Salary详细信息无法通过HTML获得,因为它是通过Ajax请求发送的,因此,每当您将Salary查找为HTML时,它就会为空。

import requests
import bs4 as bs

headers = {
    'Host': 'myjobstreet.jobstreet.com.my',
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.64 Safari/537.31',
}

login_url = 'https://myjobstreet.jobstreet.com.my/home/login.php?site=&language_code=3'
post_data_for_login = {
    "referer_url":"",
    "mobile_referer":"",
    "login_id":"**YOUR EMAIL ID**",
    "password":"**YOUR PASSWORD**",
    "remember":"on",
    "btn_login":"",
    "login":"1"
}

# Create Session.
session = requests.session()

# Login request to get cookies.
response = session.post(login_url, data=post_data_for_login, headers=headers)

print('login_response:', response.status_code)

job_page_url = 'https://www.jobstreet.com.my/en/job/fb-service-team-4126557'
job_page_json_url = job_page_url + '/panels'

# Update Host in headers.
headers['Host'] = 'www.jobstreet.com.my'

# Get Job details.
response = session.get(job_page_url, headers=headers)

# Fetch Company Name, Position and Location details from HTML.
soup = bs.BeautifulSoup(response.text, 'lxml')
company_name = soup.find("div", {"id": "company_name"}).text.strip()
position_title = soup.find("h1", {"id": "position_title"}).text.strip()
work_location = soup.find("span", {"id": "single_work_location"}).text.strip()
print('Company:', company_name);print('Position:', position_title);print('Location:', work_location)

# Get Salary data From JSON.
response = session.get(job_page_json_url, headers=headers)

# Fetch Salary details from JSON.
if response.status_code == 200:
    json_data = response.json()
    salary_tag = json_data['job_salary']

    soup = bs.BeautifulSoup(salary_tag, 'lxml')
    salary_range = soup.find("span", {"id": "salary_range"}).text
    print('Salary:', salary_range)

输出:

login_response: 200
Company: Copper Bar and Restaurant (88 Armenian Sdn Bhd)
Position: F&B Service Team
Location: Malaysia - Penang
Salary:  MYR 2,000 - MYR 2,500

答案 1 :(得分:0)

该代码不可运行。我可以看到多个问题。您不使用login_url,未定义变量urlheaders。您正在实例化一个浏览器br,使用它来使用br.open登录,但是随后您停止使用该浏览器。您应该继续使用浏览器而不是requests.get。您的目标应该是登录后获取cookie,并继续在下一页使用cookie。我不熟悉机械化,尽管这将是您从open获取html的方式。

response = br.open(url)
print(response.read())      # the text of the page

一个更好的选择可能是打开开发人员工具,查看网络请求,右键单击它,然后单击“复制为cURL”。这将向您展示如何在命令行中使用Cookie和全部重复请求。在https://developers.google.com/web/updates/2015/05/replay-a-network-request-in-curl

上查看更好的解释以及gif。