(对不起我的英语,我会尽力做到最好):
我是python中的新手,我正在寻求网络抓取的帮助。我已经有一个可操作的代码来获取我想要的链接,但网站受密码保护。 在我阅读的很多问题的帮助下,我设法获得了一个工作代码,以便在登录后抓取网站但我想要的链接在另一页上:
登录页面为http://fantasy.trashtalk.co/login.php
登录页面(我使用此代码搜索的页面)为http://fantasy.trashtalk.co/
我想要的页面是http://fantasy.trashtalk.co/?tpl=classement&t=1
所以我有这个代码(一些导入可能没用,它们来自另一个代码):
<title>TodoList App</title>
<link rel="stylesheet" type="text/css" href="https://maxcdn.bootstrapcdn.com/bootstrap/4.0.0/css/bootstrap.min.css">
<link rel="stylesheet" type="text/css" href="https://maxcdn.bootstrapcdn.com/font-awesome/4.7.0/css/font-awesome.min.css">
<link rel="stylesheet" type="text/css" href="css/style.css">
</head>
<body>
<!-- navbar -->
<nav class="navbar navbar-expand-lg navbar-drak bg-dark mb-4">
<a class="navbar-brand" href="#"><i class="fa fa-thumb-tack" aria-hidden="true"></i> Todo<strong>List</strong></a>
</nav>
<!-- /navbar -->
<!-- todoList -->
<div class="container">
<div class=" add-item text-white text-center border col-sm-12 col-md-10 col-lg-8 mb-4">
<a class="new-todo text-white text-center" href=""><i class="fa fa-plus-circle" aria-hidden="true"></i> Enter new todo item</a>
<div class="add-item text-center col-sm-12 col-md-12 col-lg-8">
<form class="mb-4">
<div class="form-group">
<input type="text" class="form-control" id="formGroupExampleInput" placeholder="Todo Title">
</div>
<div class="form-group">
<input type="text" class="form-control" id="formGroupExampleInput2" placeholder="Todo Description">
</div>
<button type="button" class="btn btn-primary btn-lg col-12">Submit Todo</button>
</form>
</div>
<!-- horizontal line -->
<hr>
<!-- list items -->
<h1 class="heading-4">Todo List Items</h1>
<ul class="list-group mt-4 pb-4">
<li class="list-group-item d-flex justify-content-between align-items-center">
Cras justo odio
<span class="badge badge-primary badge-pill">14</span>
</li>
<li class="list-group-item d-flex justify-content-between align-items-center">
Dapibus ac facilisis in
<span class="badge badge-primary badge-pill">2</span>
</li>
<li class="list-group-item d-flex justify-content-between align-items-center">
Morbi leo risus
<span class="badge badge-primary badge-pill">1</span>
</li>
<li class="list-group-item d-flex justify-content-between align-items-center">
Morbi leo risus
<span class="badge badge-primary badge-pill">1</span>
</li>
<li class="list-group-item d-flex justify-content-between align-items-center">
Morbi leo risus
<span class="badge badge-primary badge-pill">1</span>
</li>
</ul>
</div>
</div>
据我所知,这段代码只允许我访问登录页面然后抓下接下来的内容(着陆页),但我不知道如何&#34;保存&#34;我的登录信息访问我想要抓取的页面。
我想我应该在登录代码之后添加这样的东西,但是当我这样做时它只会从登录页面抓取我的链接:
from bs4 import BeautifulSoup
import requests
from lxml import html
import urllib.request
import re
username = 'myusername'
password = 'mypass'
url = "http://fantasy.trashtalk.co/?tpl=classement&t=1"
log = "http://fantasy.trashtalk.co/login.php"
values = {'email': username,
'password': password}
r = requests.post(log, data=values)
# Not sure about the code below but it works.
data = r.text
soup = BeautifulSoup(data, 'lxml')
tags = soup.find_all('a')
for link in soup.findAll('a', attrs={'href': re.compile("^https://")}):
print(link.get('href'))
此外,我在这里使用&#34;和session&#34;阅读了一些主题。东西?但我没有设法让它发挥作用。
任何帮助将不胜感激。谢谢你的时间。
答案 0 :(得分:2)
问题是您需要通过会话对象而不是请求对象发布登录凭据来保存登录凭据。我已修改下面的代码,您现在可以访问位于scrape_url
页面的html代码。祝好运!
import requests
from bs4 import BeautifulSoup
username = 'email'
password = 'password'
scrape_url = 'http://fantasy.trashtalk.co/?tpl=classement&t=1'
login_url = 'http://fantasy.trashtalk.co/login.php'
login_info = {'email': username,'password': password}
#Start session.
session = requests.session()
#Login using your authentication information.
session.post(url=login_url, data=login_info)
#Request page you want to scrape.
url = session.get(url=scrape_url)
soup = BeautifulSoup(url.content, 'html.parser')
for link in soup.findAll('a'):
print('\nLink href: ' + link['href'])
print('Link text: ' + link.text)