我试图从链接获取href链接。看起来像这样
<div class="srTitleFull pcLink"><a style="display:block" name="000 Plus system requirements" title="000 Plus System Requirements" href="../games/index.php?g_id=21580&game=000 Plus">000 Plus</a></div><div class="srDescFull"><td>000+ is a bite-sized hardcore platformer. Its mini...</td></div><div class="srDateFull">Feb-10-2015</div>
<div class="srTitleFull pcLink"><a style="display:block" name="0RBITALIS system requirements" title="0RBITALIS System Requirements" href="../games/index.php?g_id=23521&game=0RBITALIS">0RBITALIS</a></div><div class="srDescFull"><td>0RBITALIS is a satellite launching simulator with ...</td></div><div class="srDateFull">May-28-2015</div><div class="srGenreFull">Sim</div><br /></div><div class="srRowFull"><div class="srTitleFull pcLink"><a style="display:block" name="10 Years After system requirements" title="10 Years After System Requirements" href="../games/index.php?g_id=22220&game=10 Years After">10 Years After</a></div>
因此,我尝试获取../games/index.php?g_id=21580&game=000 Plus
和../games/index.php?g_id=22220&game=10 Years After
等链接。我试过了;
from bs4 import BeautifulSoup
import urllib.request
r = urllib.request.Request('http://www.game-debate.com/games/index.php?year=2015',headers={'User-Agent': 'Mozilla/5.0'})
rr = urllib.request.urlopen(r).read()
soup = BeautifulSoup(rr)
url_list = []
for x in soup.find_all("div",attrs={'class':['srTitleFull']}):
for y in soup.find_all("a", href = True):
url_list.append(y['href'])
for x in url_list:
print (x)
这可以获得链接,但打印部分会永远存在。可能是因为2 for循环,我不止一次添加到列表的链接。我无法弄清楚如何获得这些链接一次并将它们添加到列表中。
答案 0 :(得分:3)
嵌套循环的问题在于您在外循环和内循环中使用soup.find_all()
,要求BeautifulSoup
搜索整个树。您打算使用x
循环变量来搜索内部的链接,以创建特定于上下文的&#34;搜索,所以说:
url_list = []
for x in soup.find_all("div",attrs={'class':['srTitleFull']}):
for y in x.find_all("a", href = True): # < FIX applied here
url_list.append(y['href'])
有更好的方法。
我使用CSS selector找到链接:
url_list = [a['href'] for a in soup.select(".srTitleFull > a")]
其中.srTitleFull > a
将匹配位于具有a
类的元素内的所有srTitleFull
元素。
这样你根本不需要嵌套循环。