Question

我正在努力收集＆＃34; a＆＃34;类中的标签=＆＃34;特色＆＃34;来自网站http://www.pakistanfashionmagazine.com 我写了这段代码它没有错误，但它重复链接。我怎样才能克服这种重复？

from bs4 import BeautifulSoup

import requests

url = raw_input("Enter a website to extract the URL's from: ")

r  = requests.get(url)

data = r.text

soup = BeautifulSoup(data)

results= soup.findAll('div', attrs={"class":'featured'})

for div in results:
    links = div.findAll('a')
for a in links:
    print "http://www.pakistanfashionmagazine.com/" +a['href']

Answer 1

实际HTML页面每个项目<div> 有两个链接;一个用于图像，另一个用于<h4>标记：

<div class="item"> <div class="image"> <a href="/dress/casual-dresses/bella-embroidered-lawn-collection-3-stitched-suits-pkr-14000-only.html" title="BELLA Embroidered Lawn Collection*3 STITCHED SUITS@PKR 14000 ONLY"><img src="/siteimages/upload/BELLA-Embroidered-Lawn-Collection3-STITCHED-SUITSPKR-14000-ONLY_1529IM1-thumb.jpg" alt="Featured Product" /></a> </div> <div class="detail"> <h4><a href="/dress/casual-dresses/bella-embroidered-lawn-collection-3-stitched-suits-pkr-14000-only.html">BELLA Embroidered Lawn Collection*3 STITCHED SUITS@PKR 14000 ONLY</a></h4> <em>updated: 2013-06-03</em> <p>BELLA Embroidered Lawn Collection*3 STITCHED SUITS@PKR 14000 ONLY</p> </div> </div>

将您的链接限制为只有一个或另一个;我在这里使用CSS selectors：

links = soup.select('div.featured .detail a[href]') for link in links: print "http://www.pakistanfashionmagazine.com/" + link['href']

现在打印了32个链接，而不是64个。

如果您需要将此限制为第二个featured部分（美容提示），则执行此操作;选择featured div，从列表中选择第二个，然后

links = soup.select('div.featured')[1].select('.detail a[href]')

现在你只有该部分的8个链接。

在Scraping中获取重复链接

1 个答案: