I am trying to extract a link which is written like this:
<h2 class="section-heading">
<a href="http://www.nytimes.com/pages/arts/index.html">Arts »</a>
</h2>
my code is:
from bs4 import BeautifulSoup
import requests, re
def get_data():
url='http://www.nytimes.com/'
s_code=requests.get(url)
plain_text = s_code.text
soup = BeautifulSoup(plain_text)
head_links=soup.findAll('h2', {'class':'section-heading'})
for n in head_links :
a = n.find('a')
print a
print n.get['href']
#print a['href']
#print n.get('href')
#headings=n.text
#links = n.get('href')
#print headings, links
get_data()
the like "print a" simply prints out the whole <a>
line inside the <h2 class=section-heading>
i.e.
<a href="http://www.nytimes.com/pages/world/index.html">World »</a>
but when I execute "print n.get['href']", it throws me an error;
print n.get['href']
TypeError: 'instancemethod' object has no attribute '__getitem__'
Am I doing something wrong here? Please help
I couldn't find some similar case question here, my issue is a bit unique here, I am trying to extract a link that is inside a specific class names section-headings.
答案 0 :(得分:2)
首先,您要获取href
元素的a
,因此您应该访问该行上的a
而不是n
。其次,它应该是
a.get('href')
或强>
a['href']
如果没有找到这样的属性,则抛出后一种形式,而前者将返回None
,就像通常的字典/映射界面一样。由于.get
是一种方法,因此应该调用它(.get(...)
);索引/元素访问对它不起作用(.get[...]
),这就是这个问题的关键。
请注意,find
也可能在那里失败,返回None
,也许您想要迭代n.find_all('a', href=True)
:
for n in head_links:
for a in n.find_all('a', href=True):
print(a['href'])
比使用find_all
更简单的方法是使用带有CSS选择器的select
方法。这里只使用一个操作,我们只能获得<a>
属性href
的{{1}}元素,就像使用JQuery一样容易<h2 class="section-heading">
。
soup = BeautifulSoup(plain_text)
for a in soup.select('h2.section-heading a[href]'):
print(a['href'])
(另外,请使用lower-case method names in any new code that you write)。