运行我用python编写的脚本我可以看到一堆重复的结果。是否有任何解决方法来摆脱这些重复?这是我的剧本:
import requests
from lxml import html
def Startpoint():
default="http://tennishub.co.uk"
link="http://tennishub.co.uk/"
response = requests.get(link)
tree = html.fromstring(response.text)
titles = tree.xpath('//div[@class="countylist"]')
for title in titles:
links = title.xpath('.//a/@href')
for link in links:
page = default + link
Midpoint(page)
def Midpoint(address):
default="http://tennishub.co.uk"
response = requests.get(address)
tree = html.fromstring(response.text)
titles = tree.xpath('//div[@class="pagination"]')
for title in titles:
links = title.xpath('.//a/@href')
for link in links:
mlink = default + link
print(mlink)
Startpoint()
以下是我得到的截图:
答案 0 :(得分:3)
如果订单不重要,那么在links
对象周围包裹set
将会删除重复项,因为str
个实例为hashable
links = title.xpath('.//a/@href')
links = set(links)
如果您希望所有网页的链接都是唯一的,那么您需要过滤掉每个title
未经处理的链接,例如
import requests
from lxml import html
def Startpoint():
default = "http://tennishub.co.uk"
link = "http://tennishub.co.uk/"
response = requests.get(link)
tree = html.fromstring(response.text)
titles = tree.xpath('//div[@class="countylist"]')
processed_links = set()
for title in titles:
unprocessed_links = set(title.xpath('.//a/@href')) - processed_links
for link in unprocessed_links:
page = default + link
Midpoint(page)
processed_links |= unprocessed_links
def Midpoint(address):
default = "http://tennishub.co.uk"
response = requests.get(address)
tree = html.fromstring(response.text)
titles = tree.xpath('//div[@class="pagination"]')
processed_links = set()
for title in titles:
unprocessed_links = set(title.xpath('.//a/@href')) - processed_links
for link in unprocessed_links:
mlink = default + link
print(mlink)
processed_links |= unprocessed_links
Startpoint()
输出(由于set
无序,可能与您的不同):
http://tennishub.co.uk/tennis-clubs-by-county/Middlesex/3
http://tennishub.co.uk/tennis-clubs-by-county/Middlesex/10
http://tennishub.co.uk/tennis-clubs-by-county/Middlesex/2
http://tennishub.co.uk/tennis-clubs-by-county/Middlesex/4
http://tennishub.co.uk/tennis-clubs-by-county/Hampshire/4
http://tennishub.co.uk/tennis-clubs-by-county/Hampshire/7
http://tennishub.co.uk/tennis-clubs-by-county/Hampshire/2
http://tennishub.co.uk/tennis-clubs-by-county/Hampshire/3
http://tennishub.co.uk/tennis-clubs-by-county/Oxfordshire/4
http://tennishub.co.uk/tennis-clubs-by-county/Oxfordshire/2
http://tennishub.co.uk/tennis-clubs-by-county/Oxfordshire/3
http://tennishub.co.uk/tennis-clubs-by-county/Buckinghamshire/4
http://tennishub.co.uk/tennis-clubs-by-county/Buckinghamshire/3
http://tennishub.co.uk/tennis-clubs-by-county/Buckinghamshire/5
http://tennishub.co.uk/tennis-clubs-by-county/Buckinghamshire/2
http://tennishub.co.uk/tennis-clubs-by-county/Berkshire/3
http://tennishub.co.uk/tennis-clubs-by-county/Berkshire/2
http://tennishub.co.uk/tennis-clubs-by-county/Berkshire/4
http://tennishub.co.uk/tennis-clubs-by-county/West Sussex/4
http://tennishub.co.uk/tennis-clubs-by-county/West Sussex/3
http://tennishub.co.uk/tennis-clubs-by-county/West Sussex/2
http://tennishub.co.uk/tennis-clubs-by-county/East Sussex/3
http://tennishub.co.uk/tennis-clubs-by-county/East Sussex/2
http://tennishub.co.uk/tennis-clubs-by-county/Kent/8
http://tennishub.co.uk/tennis-clubs-by-county/Kent/3
http://tennishub.co.uk/tennis-clubs-by-county/Kent/4
http://tennishub.co.uk/tennis-clubs-by-county/Kent/2
http://tennishub.co.uk/tennis-clubs-by-county/Surrey/3
http://tennishub.co.uk/tennis-clubs-by-county/Surrey/4
http://tennishub.co.uk/tennis-clubs-by-county/Surrey/2
http://tennishub.co.uk/tennis-clubs-by-county/Surrey/14
http://tennishub.co.uk/tennis-clubs-by-county/Suffolk/2
http://tennishub.co.uk/tennis-clubs-by-county/Suffolk/3
http://tennishub.co.uk/tennis-clubs-by-county/Bedfordshire/2
http://tennishub.co.uk/tennis-clubs-by-county/Hertfordshire/2
http://tennishub.co.uk/tennis-clubs-by-county/Hertfordshire/3
http://tennishub.co.uk/tennis-clubs-by-county/Hertfordshire/7
http://tennishub.co.uk/tennis-clubs-by-county/Hertfordshire/4
http://tennishub.co.uk/tennis-clubs-by-county/Cambridgeshire/4
http://tennishub.co.uk/tennis-clubs-by-county/Cambridgeshire/3
http://tennishub.co.uk/tennis-clubs-by-county/Cambridgeshire/2
http://tennishub.co.uk/tennis-clubs-by-county/Norfolk/2
http://tennishub.co.uk/tennis-clubs-by-county/Norfolk/3
http://tennishub.co.uk/tennis-clubs-by-county/Essex/4
http://tennishub.co.uk/tennis-clubs-by-county/Essex/2
http://tennishub.co.uk/tennis-clubs-by-county/Essex/7
http://tennishub.co.uk/tennis-clubs-by-county/Essex/3
http://tennishub.co.uk/tennis-clubs-by-county/Cheshire/3
http://tennishub.co.uk/tennis-clubs-by-county/Cheshire/4
http://tennishub.co.uk/tennis-clubs-by-county/Cheshire/2
http://tennishub.co.uk/tennis-clubs-by-county/Cheshire/7
http://tennishub.co.uk/tennis-clubs-by-county/Cumbria/2
http://tennishub.co.uk/tennis-clubs-by-county/Lancashire/4
http://tennishub.co.uk/tennis-clubs-by-county/Lancashire/9
http://tennishub.co.uk/tennis-clubs-by-county/Lancashire/3
http://tennishub.co.uk/tennis-clubs-by-county/Lancashire/2
http://tennishub.co.uk/tennis-clubs-by-county/Warwickshire/6
http://tennishub.co.uk/tennis-clubs-by-county/Warwickshire/2
http://tennishub.co.uk/tennis-clubs-by-county/Warwickshire/3
http://tennishub.co.uk/tennis-clubs-by-county/Warwickshire/4
http://tennishub.co.uk/tennis-clubs-by-county/Staffordshire/2
http://tennishub.co.uk/tennis-clubs-by-county/Shropshire/2
http://tennishub.co.uk/tennis-clubs-by-county/Worcestershire/3
http://tennishub.co.uk/tennis-clubs-by-county/Worcestershire/2
http://tennishub.co.uk/tennis-clubs-by-county/South Yorkshire/2
http://tennishub.co.uk/tennis-clubs-by-county/West Yorkshire/3
http://tennishub.co.uk/tennis-clubs-by-county/West Yorkshire/2
http://tennishub.co.uk/tennis-clubs-by-county/West Yorkshire/4
http://tennishub.co.uk/tennis-clubs-by-county/West Yorkshire/5
http://tennishub.co.uk/tennis-clubs-by-county/Northumberland/2
http://tennishub.co.uk/tennis-clubs-by-county/East Yorkshire/2
http://tennishub.co.uk/tennis-clubs-by-county/Durham/2
http://tennishub.co.uk/tennis-clubs-by-county/North Yorkshire/2
http://tennishub.co.uk/tennis-clubs-by-county/North Yorkshire/3
http://tennishub.co.uk/tennis-clubs-by-county/Devon/5
http://tennishub.co.uk/tennis-clubs-by-county/Devon/4
http://tennishub.co.uk/tennis-clubs-by-county/Devon/2
http://tennishub.co.uk/tennis-clubs-by-county/Devon/3
http://tennishub.co.uk/tennis-clubs-by-county/Wiltshire/3
http://tennishub.co.uk/tennis-clubs-by-county/Wiltshire/2
http://tennishub.co.uk/tennis-clubs-by-county/Dorset/2
http://tennishub.co.uk/tennis-clubs-by-county/Dorset/3
http://tennishub.co.uk/tennis-clubs-by-county/Somerset/2
http://tennishub.co.uk/tennis-clubs-by-county/Somerset/4
http://tennishub.co.uk/tennis-clubs-by-county/Somerset/3
http://tennishub.co.uk/tennis-clubs-by-county/Gloucestershire/3
http://tennishub.co.uk/tennis-clubs-by-county/Gloucestershire/4
http://tennishub.co.uk/tennis-clubs-by-county/Gloucestershire/5
http://tennishub.co.uk/tennis-clubs-by-county/Gloucestershire/2
http://tennishub.co.uk/tennis-clubs-by-county/Cornwall/2
http://tennishub.co.uk/tennis-clubs-by-county/Nottinghamshire/2
http://tennishub.co.uk/tennis-clubs-by-county/Nottinghamshire/3
http://tennishub.co.uk/tennis-clubs-by-county/Lincolnshire/2
http://tennishub.co.uk/tennis-clubs-by-county/Derbyshire/2
http://tennishub.co.uk/tennis-clubs-by-county/Derbyshire/3
http://tennishub.co.uk/tennis-clubs-by-county/Leicestershire/3
http://tennishub.co.uk/tennis-clubs-by-county/Leicestershire/2
http://tennishub.co.uk/tennis-clubs-by-county/Leicestershire/4
http://tennishub.co.uk/tennis-clubs-by-county/Northamptonshire/3
http://tennishub.co.uk/tennis-clubs-by-county/Northamptonshire/2
答案 1 :(得分:0)
这是实现此目的的另一种方法:
import requests
from lxml.html import fromstring
base = "http://tennishub.co.uk{}"
link = "http://tennishub.co.uk/"
unique_links = set()
def fetch_links(link):
r = requests.get(link)
tree = fromstring(r.text)
for title_link in tree.xpath('//*[@class="countylist"]//a[@href]/@href'):
yield base.format(title_link)
def fetch_all_next_page_links(link):
r = requests.get(link)
tree = fromstring(r.text)
for item_link in tree.xpath('//*[@id="content"]/*[@class="pagination"]//a/@href'):
qualified_link = base.format(item_link)
if qualified_link not in unique_links:
yield qualified_link
unique_links.add(qualified_link)
if __name__ == '__main__':
for item in fetch_links(link):
for elem in fetch_all_next_page_links(item):
print(elem)