如何从<a> inside the </a> <h2 class =“section-heading”> <a>:BeautifulSoup

时间:2016-02-12 06:13:17

标签: python beautifulsoup python-requests bs4

I am trying to extract a link which is written like this:

<h2 class="section-heading">
    <a href="http://www.nytimes.com/pages/arts/index.html">Arts »</a>
</h2>

my code is:

from bs4 import BeautifulSoup
import requests, re

def get_data():
    url='http://www.nytimes.com/'
    s_code=requests.get(url)
    plain_text = s_code.text
    soup = BeautifulSoup(plain_text)
    head_links=soup.findAll('h2', {'class':'section-heading'})

    for n in head_links :
       a = n.find('a')
       print a
       print n.get['href'] 
       #print a['href']
       #print n.get('href')
       #headings=n.text
       #links = n.get('href')
       #print headings, links

get_data()  

the like "print a" simply prints out the whole <a> line inside the <h2 class=section-heading> i.e.

<a href="http://www.nytimes.com/pages/world/index.html">World »</a>

but when I execute "print n.get['href']", it throws me an error;

print n.get['href'] 
TypeError: 'instancemethod' object has no attribute '__getitem__'

Am I doing something wrong here? Please help

I couldn't find some similar case question here, my issue is a bit unique here, I am trying to extract a link that is inside a specific class names section-headings.

1 个答案:

答案 0 :(得分:2)

首先,您要获取href元素的a,因此您应该访问该行上的a而不是n。其次,它应该是

a.get('href')

a['href']

如果没有找到这样的属性,则抛出后一种形式,而前者将返回None,就像通常的字典/映射界面一样。由于.get是一种方法,因此应该调用它(.get(...));索引/元素访问对它不起作用(.get[...]),这就是这个问题的关键。

请注意,find也可能在那里失败,返回None,也许您想要迭代n.find_all('a', href=True)

for n in head_links:
   for a in n.find_all('a', href=True):
       print(a['href'])

比使用find_all更简单的方法是使用带有CSS选择器的select方法。这里只使用一个操作,我们只能获得<a>属性href的{​​{1}}元素,就像使用JQuery一样容易<h2 class="section-heading">

soup = BeautifulSoup(plain_text)
for a in soup.select('h2.section-heading a[href]'):
    print(a['href'])

(另外,请使用lower-case method names in any new code that you write)。