我已经在python中编写了一些脚本来获取通往下一页的所有链接。但是,它只在一定程度上正常工作。下一页链接的最大数量是255.运行我的脚本,我得到前23个链接以及最后一页链接,但它们之间[24到254]缺失。我怎样才能得到所有这些?以下是我正在尝试的内容:
import requests
from lxml import html
page_link = "https://www.yify-torrent.org/search/1080p/"
b_link = "https://www.yify-torrent.org"
def get_links(main_link):
links = []
response = requests.get(main_link).text
tree = html.fromstring(response)
for item in tree.cssselect('div.pager a'):
if item.attrib["href"] not in links:
links.append(item.attrib["href"])
for link in links:
print(b_link + link)
get_links(page_link)
下一页链接中的元素是:
<div class="pager"><a href="/search/1080p/" class="current">1</a> <a href="/search/1080p/t-2/">2</a> <a href="/search/1080p/t-3/">3</a> <a href="/search/1080p/t-4/">4</a> <a href="/search/1080p/t-5/">5</a> <a href="/search/1080p/t-6/">6</a> <a href="/search/1080p/t-7/">7</a> <a href="/search/1080p/t-8/">8</a> <a href="/search/1080p/t-9/">9</a> <a href="/search/1080p/t-10/">10</a> <a href="/search/1080p/t-11/">11</a> <a href="/search/1080p/t-12/">12</a> <a href="/search/1080p/t-13/">13</a> <a href="/search/1080p/t-14/">14</a> <a href="/search/1080p/t-15/">15</a> <a href="/search/1080p/t-16/">16</a> <a href="/search/1080p/t-17/">17</a> <a href="/search/1080p/t-18/">18</a> <a href="/search/1080p/t-19/">19</a> <a href="/search/1080p/t-20/">20</a> <a href="/search/1080p/t-21/">21</a> <a href="/search/1080p/t-22/">22</a> <a href="/search/1080p/t-23/">23</a> <a href="/search/1080p/t-2/">Next</a> <a href="/search/1080p/t-255/">Last</a> </div>
我得到的结果就像[缩减到最后五个链接]:
https://www.yify-torrent.org/search/1080p/t-20/
https://www.yify-torrent.org/search/1080p/t-21/
https://www.yify-torrent.org/search/1080p/t-22/
https://www.yify-torrent.org/search/1080p/t-23/
https://www.yify-torrent.org/search/1080p/t-255/
答案 0 :(得分:1)
Answer provided by @kaze显然应该返回255页,但是如果你需要动态获取所有链接而不用硬编码总页数,你可以尝试
r = requests.get("https://www.yify-torrent.org/search/1080p/")
tree = html.fromstring(r.content)
page_number = tree.xpath("//div[@class='pager']/a[.='Last']/@href")[0].split("/")[-2].replace("t-", "")
for page in range(int(page_number) + 1):
requests.get("https://www.yify-torrent.org/search/1080p/t-%s/" % page)
答案 1 :(得分:0)
如果链接结构不合适,则必须“走遍网站”,但在这里您也可以自己生成链接,如下所示:
for i in range(1,256):
print('https://www.yify-torrent.org/search/1080p/t-%s/' % i)
答案 2 :(得分:-1)
您的脚本看起来是正确的。查看该页面的HTML,我看到了:
<a href="/search/1080p/t-2/">2</a>
<a href="/search/1080p/t-3/">3</a>
<a href="/search/1080p/t-4/">4</a>
<a href="/search/1080p/t-5/">5</a>
<a href="/search/1080p/t-6/">6</a>
<a href="/search/1080p/t-7/">7</a>
<a href="/search/1080p/t-8/">8</a>
<a href="/search/1080p/t-9/">9</a>
<a href="/search/1080p/t-10/">10</a>
<a href="/search/1080p/t-11/">11</a>
<a href="/search/1080p/t-12/">12</a>
<a href="/search/1080p/t-13/">13</a>
<a href="/search/1080p/t-14/">14</a>
<a href="/search/1080p/t-15/">15</a>
<a href="/search/1080p/t-16/">16</a>
<a href="/search/1080p/t-17/">17</a>
<a href="/search/1080p/t-18/">18</a>
<a href="/search/1080p/t-19/">19</a>
<a href="/search/1080p/t-20/">20</a>
<a href="/search/1080p/t-21/">21</a>
<a href="/search/1080p/t-22/">22</a>
<a href="/search/1080p/t-23/">23</a>
<a href="/search/1080p/t-2/">Next</a>
<a href="/search/1080p/t-255/">Last</a>
似乎t-2
是指向Next
页面的指针,该页面将包含其余链接。