使用python请求,BeautifulSoup和Selenium刮擦需要身份验证的动态网页

时间:2017-07-26 17:04:08

标签: javascript python-3.x selenium beautifulsoup phantomjs

我正在为登录网站开展网络抓取项目。我成功登录了。该站点包含一个动态表。当我运行我的代码时,它会抓取页面而不是动态内容,我尝试使用selenium但它总是要求我登录Chrome而不是将我带到页面。

以下是该页面的登录代码:

import requests
from bs4 import BeautifulSoup
from selenium import webdriver
import time

server = requests.Session()

login_page_url = 'https://connect.data.com/login'
loginProcess_url = 'https://connect.data.com/loginProcess'

html = server.get(login_page_url).content
soup = BeautifulSoup(html, 'html.parser')
csrf = soup.find(id="CSRF_TOKEN")['value']

login_detail = {
    'j_username':'******',
    'j_password':'******',
    'CSRF_TOKEN': csrf,
}

server.post(loginProcess_url, data=login_detail)

r = server.get('https://connect.data.com/search#p=searchresult;;t=companies;;ss=advancedsearch;;q=H4sIAAAAAAAAAE2PzQ6CQAyE36VnDgsKGq48gMar4UCWqptAa_YHYwjv7nYJ6mUyO52v2c5wM4NH66CeQXMgbw3GxxWOSiloMzDUB_dNc1UoGWSQF7vNqWpzufpOk5UFOD4HfuPKlzLci-xECpGjyEGkSoDFCSms_V8rQeV_NZGxzy-KBzzMc_2hANAuGXTaGyZ3ooaHMFI6UbIJGyYfXUocWw819Og0LJHSwVokf-7uCHVeZuDZd8MFNds-7lrzUi0flYbWRDoBAAA')
soup = BeautifulSoup(r.text)
print (soup.find('table',{"class":"result"}))

以下是我为了动态内容而添加的代码:

path_to_driver = '/Users/Moment/Desktop/phantomjs'

url = 'https://connect.data.com/search#p=searchresult;;t=companies;;ss=advancedsearch;;q=H4sIAAAAAAAAAE2PzQ6CQAyE36VnDgsKGq48gMar4UCWqptAa_YHYwjv7nYJ6mUyO52v2c5wM4NH66CeQXMgbw3GxxWOSiloMzDUB_dNc1UoGWSQF7vNqWpzufpOk5UFOD4HfuPKlzLci-xECpGjyEGkSoDFCSms_V8rQeV_NZGxzy-KBzzMc_2hANAuGXTaGyZ3ooaHMFI6UbIJGyYfXUocWw819Og0LJHSwVokf-7uCHVeZuDZd8MFNds-7lrzUi0flYbWRDoBAAA'


browser = webdriver.PhantomJS(executable_path = path_to_driver)
browser.get(url)

html = browser.page_source
soup = BeautifulSoup(html, "lxml")
print(soup.prettify())

第一部分代码会让我登录,但每次添加第二部分代码时,我都不再登录。相反,我会登录页面。

我使用过Chromedriver和PhantomJS。

0 个答案:

没有答案