在BeautifulSoup中正确获取href标签

时间:2016-06-30 14:15:59

标签: python web-scraping beautifulsoup html-parsing python-3.4

我试图从链接获取href链接。看起来像这样

<div class="srTitleFull pcLink"><a style="display:block" name="000 Plus system requirements" title="000 Plus System Requirements" href="../games/index.php?g_id=21580&game=000 Plus">000 Plus</a></div><div class="srDescFull"><td>000+ is a bite-sized hardcore platformer. Its mini...</td></div><div class="srDateFull">Feb-10-2015</div>

<div class="srTitleFull pcLink"><a style="display:block" name="0RBITALIS system requirements" title="0RBITALIS System Requirements" href="../games/index.php?g_id=23521&game=0RBITALIS">0RBITALIS</a></div><div class="srDescFull"><td>0RBITALIS is a satellite launching simulator with ...</td></div><div class="srDateFull">May-28-2015</div><div class="srGenreFull">Sim</div><br /></div><div class="srRowFull"><div class="srTitleFull pcLink"><a style="display:block" name="10 Years After system requirements" title="10 Years After System Requirements" href="../games/index.php?g_id=22220&game=10 Years After">10 Years After</a></div>

因此,我尝试获取../games/index.php?g_id=21580&game=000 Plus../games/index.php?g_id=22220&game=10 Years After等链接。我试过了;

from bs4 import BeautifulSoup
import urllib.request

r = urllib.request.Request('http://www.game-debate.com/games/index.php?year=2015',headers={'User-Agent': 'Mozilla/5.0'})
rr = urllib.request.urlopen(r).read()
soup = BeautifulSoup(rr)


url_list = []
for x in soup.find_all("div",attrs={'class':['srTitleFull']}):
   for y in soup.find_all("a", href = True):
        url_list.append(y['href'])
for x in url_list:
    print (x)

这可以获得链接,但打印部分会永远存在。可能是因为2 for循环,我不止一次添加到列表的链接。我无法弄清楚如何获得这些链接一次并将它们添加到列表中。

1 个答案:

答案 0 :(得分:3)

嵌套循环的问题在于您在外循环和内循环中使用soup.find_all(),要求BeautifulSoup搜索整个树。您打算使用x循环变量来搜索内部的链接,以创建特定于上下文的&#34;搜索,所以说:

url_list = []
for x in soup.find_all("div",attrs={'class':['srTitleFull']}):
   for y in x.find_all("a", href = True):  # < FIX applied here
        url_list.append(y['href'])

有更好的方法。

我使用CSS selector找到链接:

url_list = [a['href'] for a in soup.select(".srTitleFull > a")]

其中.srTitleFull > a将匹配位于具有a类的元素内的所有srTitleFull元素。

这样你根本不需要嵌套循环。