我正在为登录网站开展网络抓取项目。我成功登录了。该站点包含一个动态表。当我运行我的代码时,它会抓取页面而不是动态内容,我尝试使用selenium但它总是要求我登录Chrome而不是将我带到页面。
以下是该页面的登录代码:
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
import time
server = requests.Session()
login_page_url = 'https://connect.data.com/login'
loginProcess_url = 'https://connect.data.com/loginProcess'
html = server.get(login_page_url).content
soup = BeautifulSoup(html, 'html.parser')
csrf = soup.find(id="CSRF_TOKEN")['value']
login_detail = {
'j_username':'******',
'j_password':'******',
'CSRF_TOKEN': csrf,
}
server.post(loginProcess_url, data=login_detail)
r = server.get('https://connect.data.com/search#p=searchresult;;t=companies;;ss=advancedsearch;;q=H4sIAAAAAAAAAE2PzQ6CQAyE36VnDgsKGq48gMar4UCWqptAa_YHYwjv7nYJ6mUyO52v2c5wM4NH66CeQXMgbw3GxxWOSiloMzDUB_dNc1UoGWSQF7vNqWpzufpOk5UFOD4HfuPKlzLci-xECpGjyEGkSoDFCSms_V8rQeV_NZGxzy-KBzzMc_2hANAuGXTaGyZ3ooaHMFI6UbIJGyYfXUocWw819Og0LJHSwVokf-7uCHVeZuDZd8MFNds-7lrzUi0flYbWRDoBAAA')
soup = BeautifulSoup(r.text)
print (soup.find('table',{"class":"result"}))
以下是我为了动态内容而添加的代码:
path_to_driver = '/Users/Moment/Desktop/phantomjs'
url = 'https://connect.data.com/search#p=searchresult;;t=companies;;ss=advancedsearch;;q=H4sIAAAAAAAAAE2PzQ6CQAyE36VnDgsKGq48gMar4UCWqptAa_YHYwjv7nYJ6mUyO52v2c5wM4NH66CeQXMgbw3GxxWOSiloMzDUB_dNc1UoGWSQF7vNqWpzufpOk5UFOD4HfuPKlzLci-xECpGjyEGkSoDFCSms_V8rQeV_NZGxzy-KBzzMc_2hANAuGXTaGyZ3ooaHMFI6UbIJGyYfXUocWw819Og0LJHSwVokf-7uCHVeZuDZd8MFNds-7lrzUi0flYbWRDoBAAA'
browser = webdriver.PhantomJS(executable_path = path_to_driver)
browser.get(url)
html = browser.page_source
soup = BeautifulSoup(html, "lxml")
print(soup.prettify())
第一部分代码会让我登录,但每次添加第二部分代码时,我都不再登录。相反,我会登录页面。
我使用过Chromedriver和PhantomJS。