得到<a> element within a div</a>名字的美味汤

时间:2015-03-04 18:36:38

标签: beautifulsoup html-parsing

我第一次使用Beautiful Soup,我正在尝试获取网页中特定元素的值。

例如,在此代码段中:

<div class="otg-vendor-name"><a class="otg-vendor-name-link"     href="http://www.3brotherskitchen.com" target="_blank">3 Brothers Kitchen</a></div>

我希望从标签中获得“3兄弟厨房”。

到目前为止,我尝试了一些似乎不起作用的东西:

import urllib2
from bs4 import BeautifulSoup

url    = "http://someurl"
def get_all_vendors():
   try:
      web_page = urllib2.urlopen(url).read()
      soup = BeautifulSoup(web_page)
      c = []
      c.append(soup.findAll("div", {"class":'otg-vendor-name'}).contents)
    print c

   except urllib2.HTTPError:
   print("HTTPERROR!")

   except urllib2.URLError:
   print("URLERROR!")

   return c

1 个答案:

答案 0 :(得分:0)

您可以通过CSS selector

获取
soup.select('div.otg-vendor-name > a.otg-vendor-name-link')[0].text

或者,通过find()

soup.find('div', class_='otg-vendor-name').find('a', class_='otg-vendor-name-link').text

更新(使用requests并提供User-Agent标题):

from bs4 import BeautifulSoup
import requests

url = 'http://offthegridsf.com/vendors#food'

with requests.Session() as session:
    session.headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.115 Safari/537.36'}

    session.get(url)

    response = session.get(url)
    soup = BeautifulSoup(response.content)

    print soup.select('div.otg-vendor-name > a.otg-vendor-name-link')[0].text
    print soup.find('div', class_='otg-vendor-name').find('a', class_='otg-vendor-name-link').text