在抓取时删除重复的链接

时间:2017-05-04 22:27:10

标签: python web-scraping web-crawler

运行我用python编写的脚本我可以看到一堆重复的结果。是否有任何解决方法来摆脱这些重复?这是我的剧本:

import requests
from lxml import html

def Startpoint():
    default="http://tennishub.co.uk"
    link="http://tennishub.co.uk/"
    response = requests.get(link)
    tree = html.fromstring(response.text)
    titles = tree.xpath('//div[@class="countylist"]')
    for title in titles:
        links = title.xpath('.//a/@href')
        for link in links:
            page = default + link
            Midpoint(page)

def Midpoint(address):
    default="http://tennishub.co.uk"
    response = requests.get(address)
    tree = html.fromstring(response.text)
    titles = tree.xpath('//div[@class="pagination"]')
    for title in titles:
        links = title.xpath('.//a/@href')
        for link in links:
            mlink = default + link
            print(mlink)

Startpoint()

以下是我得到的截图:

enter image description here

2 个答案:

答案 0 :(得分:3)

如果订单不重要,那么在links对象周围包裹set将会删除重复项,因为str个实例为hashable

links = title.xpath('.//a/@href')
links = set(links)

如果您希望所有网页的链接都是唯一的,那么您需要过滤掉每个title未经处理的链接,例如

import requests
from lxml import html


def Startpoint():
    default = "http://tennishub.co.uk"
    link = "http://tennishub.co.uk/"
    response = requests.get(link)
    tree = html.fromstring(response.text)
    titles = tree.xpath('//div[@class="countylist"]')
    processed_links = set()
    for title in titles:
        unprocessed_links = set(title.xpath('.//a/@href')) - processed_links
        for link in unprocessed_links:
            page = default + link
            Midpoint(page)
        processed_links |= unprocessed_links


def Midpoint(address):
    default = "http://tennishub.co.uk"
    response = requests.get(address)
    tree = html.fromstring(response.text)
    titles = tree.xpath('//div[@class="pagination"]')
    processed_links = set()
    for title in titles:
        unprocessed_links = set(title.xpath('.//a/@href')) - processed_links
        for link in unprocessed_links:
            mlink = default + link
            print(mlink)
        processed_links |= unprocessed_links


Startpoint()

输出(由于set无序,可能与您的不同):

http://tennishub.co.uk/tennis-clubs-by-county/Middlesex/3
http://tennishub.co.uk/tennis-clubs-by-county/Middlesex/10
http://tennishub.co.uk/tennis-clubs-by-county/Middlesex/2
http://tennishub.co.uk/tennis-clubs-by-county/Middlesex/4
http://tennishub.co.uk/tennis-clubs-by-county/Hampshire/4
http://tennishub.co.uk/tennis-clubs-by-county/Hampshire/7
http://tennishub.co.uk/tennis-clubs-by-county/Hampshire/2
http://tennishub.co.uk/tennis-clubs-by-county/Hampshire/3
http://tennishub.co.uk/tennis-clubs-by-county/Oxfordshire/4
http://tennishub.co.uk/tennis-clubs-by-county/Oxfordshire/2
http://tennishub.co.uk/tennis-clubs-by-county/Oxfordshire/3
http://tennishub.co.uk/tennis-clubs-by-county/Buckinghamshire/4
http://tennishub.co.uk/tennis-clubs-by-county/Buckinghamshire/3
http://tennishub.co.uk/tennis-clubs-by-county/Buckinghamshire/5
http://tennishub.co.uk/tennis-clubs-by-county/Buckinghamshire/2
http://tennishub.co.uk/tennis-clubs-by-county/Berkshire/3
http://tennishub.co.uk/tennis-clubs-by-county/Berkshire/2
http://tennishub.co.uk/tennis-clubs-by-county/Berkshire/4
http://tennishub.co.uk/tennis-clubs-by-county/West Sussex/4
http://tennishub.co.uk/tennis-clubs-by-county/West Sussex/3
http://tennishub.co.uk/tennis-clubs-by-county/West Sussex/2
http://tennishub.co.uk/tennis-clubs-by-county/East Sussex/3
http://tennishub.co.uk/tennis-clubs-by-county/East Sussex/2
http://tennishub.co.uk/tennis-clubs-by-county/Kent/8
http://tennishub.co.uk/tennis-clubs-by-county/Kent/3
http://tennishub.co.uk/tennis-clubs-by-county/Kent/4
http://tennishub.co.uk/tennis-clubs-by-county/Kent/2
http://tennishub.co.uk/tennis-clubs-by-county/Surrey/3
http://tennishub.co.uk/tennis-clubs-by-county/Surrey/4
http://tennishub.co.uk/tennis-clubs-by-county/Surrey/2
http://tennishub.co.uk/tennis-clubs-by-county/Surrey/14
http://tennishub.co.uk/tennis-clubs-by-county/Suffolk/2
http://tennishub.co.uk/tennis-clubs-by-county/Suffolk/3
http://tennishub.co.uk/tennis-clubs-by-county/Bedfordshire/2
http://tennishub.co.uk/tennis-clubs-by-county/Hertfordshire/2
http://tennishub.co.uk/tennis-clubs-by-county/Hertfordshire/3
http://tennishub.co.uk/tennis-clubs-by-county/Hertfordshire/7
http://tennishub.co.uk/tennis-clubs-by-county/Hertfordshire/4
http://tennishub.co.uk/tennis-clubs-by-county/Cambridgeshire/4
http://tennishub.co.uk/tennis-clubs-by-county/Cambridgeshire/3
http://tennishub.co.uk/tennis-clubs-by-county/Cambridgeshire/2
http://tennishub.co.uk/tennis-clubs-by-county/Norfolk/2
http://tennishub.co.uk/tennis-clubs-by-county/Norfolk/3
http://tennishub.co.uk/tennis-clubs-by-county/Essex/4
http://tennishub.co.uk/tennis-clubs-by-county/Essex/2
http://tennishub.co.uk/tennis-clubs-by-county/Essex/7
http://tennishub.co.uk/tennis-clubs-by-county/Essex/3
http://tennishub.co.uk/tennis-clubs-by-county/Cheshire/3
http://tennishub.co.uk/tennis-clubs-by-county/Cheshire/4
http://tennishub.co.uk/tennis-clubs-by-county/Cheshire/2
http://tennishub.co.uk/tennis-clubs-by-county/Cheshire/7
http://tennishub.co.uk/tennis-clubs-by-county/Cumbria/2
http://tennishub.co.uk/tennis-clubs-by-county/Lancashire/4
http://tennishub.co.uk/tennis-clubs-by-county/Lancashire/9
http://tennishub.co.uk/tennis-clubs-by-county/Lancashire/3
http://tennishub.co.uk/tennis-clubs-by-county/Lancashire/2
http://tennishub.co.uk/tennis-clubs-by-county/Warwickshire/6
http://tennishub.co.uk/tennis-clubs-by-county/Warwickshire/2
http://tennishub.co.uk/tennis-clubs-by-county/Warwickshire/3
http://tennishub.co.uk/tennis-clubs-by-county/Warwickshire/4
http://tennishub.co.uk/tennis-clubs-by-county/Staffordshire/2
http://tennishub.co.uk/tennis-clubs-by-county/Shropshire/2
http://tennishub.co.uk/tennis-clubs-by-county/Worcestershire/3
http://tennishub.co.uk/tennis-clubs-by-county/Worcestershire/2
http://tennishub.co.uk/tennis-clubs-by-county/South Yorkshire/2
http://tennishub.co.uk/tennis-clubs-by-county/West Yorkshire/3
http://tennishub.co.uk/tennis-clubs-by-county/West Yorkshire/2
http://tennishub.co.uk/tennis-clubs-by-county/West Yorkshire/4
http://tennishub.co.uk/tennis-clubs-by-county/West Yorkshire/5
http://tennishub.co.uk/tennis-clubs-by-county/Northumberland/2
http://tennishub.co.uk/tennis-clubs-by-county/East Yorkshire/2
http://tennishub.co.uk/tennis-clubs-by-county/Durham/2
http://tennishub.co.uk/tennis-clubs-by-county/North Yorkshire/2
http://tennishub.co.uk/tennis-clubs-by-county/North Yorkshire/3
http://tennishub.co.uk/tennis-clubs-by-county/Devon/5
http://tennishub.co.uk/tennis-clubs-by-county/Devon/4
http://tennishub.co.uk/tennis-clubs-by-county/Devon/2
http://tennishub.co.uk/tennis-clubs-by-county/Devon/3
http://tennishub.co.uk/tennis-clubs-by-county/Wiltshire/3
http://tennishub.co.uk/tennis-clubs-by-county/Wiltshire/2
http://tennishub.co.uk/tennis-clubs-by-county/Dorset/2
http://tennishub.co.uk/tennis-clubs-by-county/Dorset/3
http://tennishub.co.uk/tennis-clubs-by-county/Somerset/2
http://tennishub.co.uk/tennis-clubs-by-county/Somerset/4
http://tennishub.co.uk/tennis-clubs-by-county/Somerset/3
http://tennishub.co.uk/tennis-clubs-by-county/Gloucestershire/3
http://tennishub.co.uk/tennis-clubs-by-county/Gloucestershire/4
http://tennishub.co.uk/tennis-clubs-by-county/Gloucestershire/5
http://tennishub.co.uk/tennis-clubs-by-county/Gloucestershire/2
http://tennishub.co.uk/tennis-clubs-by-county/Cornwall/2
http://tennishub.co.uk/tennis-clubs-by-county/Nottinghamshire/2
http://tennishub.co.uk/tennis-clubs-by-county/Nottinghamshire/3
http://tennishub.co.uk/tennis-clubs-by-county/Lincolnshire/2
http://tennishub.co.uk/tennis-clubs-by-county/Derbyshire/2
http://tennishub.co.uk/tennis-clubs-by-county/Derbyshire/3
http://tennishub.co.uk/tennis-clubs-by-county/Leicestershire/3
http://tennishub.co.uk/tennis-clubs-by-county/Leicestershire/2
http://tennishub.co.uk/tennis-clubs-by-county/Leicestershire/4
http://tennishub.co.uk/tennis-clubs-by-county/Northamptonshire/3
http://tennishub.co.uk/tennis-clubs-by-county/Northamptonshire/2

答案 1 :(得分:0)

这是实现此目的的另一种方法:

import requests
from lxml.html import fromstring

base = "http://tennishub.co.uk{}"
link = "http://tennishub.co.uk/"

unique_links = set()

def fetch_links(link):
    r = requests.get(link)
    tree = fromstring(r.text)
    for title_link in tree.xpath('//*[@class="countylist"]//a[@href]/@href'):
        yield base.format(title_link)

def fetch_all_next_page_links(link):
    r = requests.get(link)
    tree = fromstring(r.text)
    for item_link in tree.xpath('//*[@id="content"]/*[@class="pagination"]//a/@href'):
        qualified_link = base.format(item_link)
        if qualified_link not in unique_links:
            yield qualified_link
        unique_links.add(qualified_link)

if __name__ == '__main__':
    for item in fetch_links(link):
        for elem in fetch_all_next_page_links(item):
            print(elem)