如何限制重复链接被解析?

时间:2017-07-26 09:04:55

标签: python python-3.x web-scraping css-selectors web-crawler

我已经在python中编写了一些脚本来抓取该网页中可用的下一页链接。这个刮刀的唯一问题是它无法摆脱重复的链接。希望有人能帮助我实现这一目标。我试过了:

import requests
from lxml import html

page_link = "https://yts.ag/browse-movies"

def nextpage_links(main_link):
    response = requests.get(main_link).text
    tree = html.fromstring(response)
    for item in tree.cssselect('ul.tsc_pagination a'):
        if "page" in item.attrib["href"]:
            print(item.attrib["href"])

nextpage_links(page_link)

这是我得到的部分形象:

enter image description here

1 个答案:

答案 0 :(得分:1)

您可以将set用于此目的:

import requests
from lxml import html

page_link = "https://yts.ag/browse-movies"

def nextpage_links(main_link):
    links = set()
    response = requests.get(main_link).text
    tree = html.fromstring(response)
    for item in tree.cssselect('ul.tsc_pagination a'):
        if "page" in item.attrib["href"]:
            links.add(item.attrib["href"])

    return links

nextpage_links(page_link)

您还可以使用scrapy,默认情况下会限制重复项。