Question

next_page = ‘https://research.stlouisfed.org/fred2/tags/series?et=&pageID=1&t='
opened_url = urllib2.urlopen(next_page).read()

soup = BeautifulSoup(opened_url)

hrefs = soup.find_all("div",{"class":"col-xs-12 col-sm-10"})

hrefs现在看起来像这样：

[<div class="col-xs-12 col-sm-10">\n<a class="series-title" href="/fred2/series/GDPC1" style="font-size:1.2em">Real Gross Domestic Product</a>\n</div>, <div class="col-xs-12 col-sm-10">\n<a class="series-title" href="/fred2/series/CPIAUCSL" style="font-size:1.2em">Consumer Price Index for All Urban Consumers: All Items</a>\n</div>, ...

我尝试使用href之类的内容获取hrefs[1]['href']，但我收到以下错误：

Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/Library/Python/2.7/site-packages/bs4/element.py", line 958, in __getitem__
    return self.attrs[key]
KeyError: 'href'

我只是希望从此页面中删除所有18个链接。我想我可以将hrefs中的每个元素转换为字符串，然后只转换find那里的href，但这种方法违背了bs4的目的。

Answer 1

您需要提取a代码href

hrefs = soup.find_all("div",{"class":"col-xs-12 col-sm-10"})
print hrefs[1].find('a')['href']

要获取div标记内的所有标记href，您可以使用

for tag in hrefs:
    print tag.find('a', href=True)['href']

无法从BeautifulSoup resultSet对象获取href，但可以获得“样式”和“类”？

1 个答案: