带登录页面的Web抓取

时间:2019-07-10 22:10:38

标签: python beautifulsoup python-requests lxml

我正在尝试使用python抓取带有登录页面的网页。我尝试使用stackoverflow中的一些示例作为工作示例,但这些示例似乎都不适合我

尝试1:

import requests
from lxml import html

USERNAME = "my username"
PASSWORD = "my password"
TOKEN = "my token"

LOGIN_URL = "https://example.com/admin/login"
URL = "https://example.com/admin/tickets"

session_requests = requests.session()

# Get login csrf token
result = session_requests.get(LOGIN_URL)
tree = html.fromstring(result.text)authenticity_token = list(set(tree.xpath("//input[@name='_token']/@value")))[0]

# Create payload
payload = {
"name": USERNAME, 
"password": PASSWORD, 
"_token": TOKEN
}

# Perform login
result = session_requests.post(LOGIN_URL, data = payload, headers = dict(referer = LOGIN_URL))

# Scrape url
result = session_requests.get(URL, headers = dict(referer = URL))
tree = html.fromstring(result.content)
bucket_names = tree.xpath("//div[@class='a']/a/text()")

print(bucket_names)

尝试2:

import requests
from bs4 import BeautifulSoup

username = 'my username'
password = 'my password'
scrape_url = 'https://example.com/admin/tickets'

login_url = 'https://example.com/admin/login'
login_info = {'name': username,'password': password}

#Start session.
session = requests.session()

#Login using your authentication information.
session.post(url=login_url, data=login_info)

#Request page you want to scrape.
url = session.get(url=scrape_url)

soup = BeautifulSoup(url.content, 'html.parser')

for link in soup.findAll('a'):
print('\nLink href: ' + link['href'])
print('Link text: ' + link.text)

第一个示例显示结果:

[]

第二个给我从登录页面的链接,而不是主键URL的链接

我真的不确定是什么问题,任何指针将不胜感激

谢谢

瑞安

0 个答案:

没有答案