以下是一些链接,我是从我正在抓取的网站上复制的。问题是,在那里的站点地图中,一些主要类别出现不止一次,如:“时尚”,“音频视觉”和“计算机服务器”。但我只需要这些链接一次。我怎样才能实现这一点,我使用var“counter”来检查第二次出现,但这也无济于事。
<a href="http://www.example.com/networking-storage">Networking Storage</a>
<a href="http://www.example.com/mobiles-tablets">Mobiles Tablets</a>
<a href="http://www.example.com/fashion">Fashion</a>
<a href="http://www.example.com/fashion">Fashion</a>
<a href="http://www.example.com/printers-scanners">Printers Scanners</a>
<a href="http://www.example.com/audio-visual">Audio Visual</a>
<a href="http://www.example.com/audio-visual">Audio Visual</a>
<a href="http://www.example.com/cameras">Cameras</a>
<a href="http://www.example.com/computers-servers">Computers Servers</a>
<a href="http://www.example.com/computers-servers">Computers Servers</a>
这是我获取这些链接的python代码:
mainPage = requests.get("http://www.example.com/catalog/seo_sitemap/category/?p=1")
mainTree = html.fromstring(mainPage.text)
for mainCat in mainTree.cssselect('a'):
print (mainCat.get('href'))
打印 -
http://www.example.com/mobiles-tablets
http://www.example.com/fashion
http://www.example.com/fashion
http://www.example.com/printers-scanners
http://www.example.com/audio-visual
http://www.example.com/audio-visual
http://www.example.com/cameras
http://www.example.com/computers-servers
http://www.example.com/computers-servers
虽然我需要这样:
http://www.example.com/mobiles-tablets
http://www.example.com/fashion
http://www.example.com/printers-scanners
http://www.example.com/audio-visual
http://www.example.com/cameras
http://www.example.com/computers-servers
答案 0 :(得分:1)
下面的代码对我有用 -
import requests
from lxml.cssselect import CSSSelector
from lxml import html
s='''<a href="http://www.example.com/mobiles-tablets">Mobiles Tablets</a>
<a href="http://www.example.com/fashion">Fashion</a>
<a href="http://www.example.com/fashion">Fashion</a>
<a href="http://www.example.com/printers-scanners">Printers Scanners</a>
<a href="http://www.example.com/audio-visual">Audio Visual</a>
<a href="http://www.example.com/audio-visual">Audio Visual</a>
<a href="http://www.example.com/cameras">Cameras</a>
<a href="http://www.example.com/computers-servers">Computers Servers</a>
<a href="http://www.example.com/computers-servers">Computers Servers</a>'''
#mainPage = requests.get("http://www.example.com/catalog/seo_sitemap/category/?p=1")
mainTree = html.fromstring(s)
mainTree = html.fromstring(s)
lnks = set([i.get('href') for i in mainTree.cssselect('a')])
for i in lnks:
print i
打印 -
http://www.example.com/mobiles-tablets
http://www.example.com/printers-scanners
http://www.example.com/fashion
http://www.example.com/audio-visual
http://www.example.com/computers-servers
http://www.example.com/cameras