使用BeautifulSoup提取链接的标题

时间:2015-09-12 18:55:49

标签: python python-2.7 web-scraping beautifulsoup python-requests

我试图使用BeautifulSoup提取链接的标题。我正在使用的代码如下:

url = "http://www.example.com"
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "lxml")
for link in soup.findAll('a', {'class': 'a-link-normal s-access-detail-page  a-text-normal'}):
    title = link.get('title')
    print title

现在,示例link元素包含以下内容:

<a class="a-link-normal s-access-detail-page a-text-normal" href="http://www.amazon.in/Introduction-Computation-Programming-Using-Python/dp/8120348664" title="Introduction To Computation And Programming Using Python"><h2 class="a-size-medium a-color-null s-inline s-access-title a-text-normal">Introduction To Computation And Programming Using <strong>Python</strong></h2></a>

但是,运行上面的代码后没有显示任何内容。如何提取存储在title中的锚标记的link属性中存储的值?

2 个答案:

答案 0 :(得分:3)

好吧,您似乎在s-access-detail-pagea-text-normal之间放了两个空格,而这些空格又无法找到任何匹配的链接。尝试使用正确的空格数,然后打印找到的链接数。此外,您可以打印标签本身 - print link

import requests
from bs4 import BeautifulSoup

url = "http://www.amazon.in/s/ref=nb_sb_noss_1?url=search-alias%3Daps&field-keywords=python"
source_code = requests.get(url)
plain_text = source_code.content
soup = BeautifulSoup(plain_text, "lxml")
links = soup.findAll('a', {'class': 'a-link-normal s-access-detail-page a-text-normal'})
print len(links)
for link in links:
    title = link.get('title')
    print title

答案 1 :(得分:1)

您在此处使用多个类搜索完全字符串。在这种情况下,类字符串必须与完全匹配,并使用单个空格。

请参阅文档中的Searching by CSS class section

  

您还可以搜索class属性的确切字符串值:

css_soup.find_all("p", class_="body strikeout")
# [<p class="body strikeout"></p>]
     

但是搜索字符串值的变体将不起作用:

css_soup.find_all("p", class_="strikeout body")
# []

您可以更好地搜索各个班级:

soup.find_all('a', class_='a-link-normal')

如果必须匹配多个班级,请使用CSS selector

soup.select('a.a-link-normal.s-access-detail-page.a-text-normal')

并且列出课程的顺序无关紧要。

演示:

>>> from bs4 import BeautifulSoup
>>> plain_text = u'<a class="a-link-normal s-access-detail-page a-text-normal" href="http://www.amazon.in/Introduction-Computation-Programming-Using-Python/dp/8120348664" title="Introduction To Computation And Programming Using Python"><h2 class="a-size-medium a-color-null s-inline s-access-title a-text-normal">Introduction To Computation And Programming Using <strong>Python</strong></h2></a>'
>>> soup = BeautifulSoup(plain_text)
>>> for link in soup.find_all('a', class_='a-link-normal'):
...     print link.text
... 
Introduction To Computation And Programming Using Python
>>> for link in soup.select('a.a-link-normal.s-access-detail-page.a-text-normal'):
...     print link.text
... 
Introduction To Computation And Programming Using Python