我正在尝试使用python抓取带有登录页面的网页。我尝试使用stackoverflow中的一些示例作为工作示例,但这些示例似乎都不适合我
尝试1:
import requests
from lxml import html
USERNAME = "my username"
PASSWORD = "my password"
TOKEN = "my token"
LOGIN_URL = "https://example.com/admin/login"
URL = "https://example.com/admin/tickets"
session_requests = requests.session()
# Get login csrf token
result = session_requests.get(LOGIN_URL)
tree = html.fromstring(result.text)authenticity_token = list(set(tree.xpath("//input[@name='_token']/@value")))[0]
# Create payload
payload = {
"name": USERNAME,
"password": PASSWORD,
"_token": TOKEN
}
# Perform login
result = session_requests.post(LOGIN_URL, data = payload, headers = dict(referer = LOGIN_URL))
# Scrape url
result = session_requests.get(URL, headers = dict(referer = URL))
tree = html.fromstring(result.content)
bucket_names = tree.xpath("//div[@class='a']/a/text()")
print(bucket_names)
尝试2:
import requests
from bs4 import BeautifulSoup
username = 'my username'
password = 'my password'
scrape_url = 'https://example.com/admin/tickets'
login_url = 'https://example.com/admin/login'
login_info = {'name': username,'password': password}
#Start session.
session = requests.session()
#Login using your authentication information.
session.post(url=login_url, data=login_info)
#Request page you want to scrape.
url = session.get(url=scrape_url)
soup = BeautifulSoup(url.content, 'html.parser')
for link in soup.findAll('a'):
print('\nLink href: ' + link['href'])
print('Link text: ' + link.text)
第一个示例显示结果:
[]
第二个给我从登录页面的链接,而不是主键URL的链接
我真的不确定是什么问题,任何指针将不胜感激
谢谢
瑞安