当我在“Inspect Element”模式中查看时,我正试图拉出被归类为(文本)的东西:
<div class="sammy"
<div class = "sammyListing">
<a href="/Chicago_Magazine/blahblahblah">
<b>BLT</b>
<br>
"
Old Oak Tap" <---**THIS IS THE TEXT I WANT**
<br>
<em>Read more</em>
</a>
</div>
</div>
到目前为止,这是我的代码,最后的行是最后列表理解:
STEM_URL = 'http://www.chicagomag.com'
BASE_URL = 'http://www.chicagomag.com/Chicago-Magazine/November-2012/Best-Sandwiches-Chicago/'
soup = BeautifulSoup(urlopen(BASE_URL).read())
sammies = soup.find_all("div", "sammy")
sammy_urls = []
for div in sammies:
if div.a["href"].startswith("http"):
sammy_urls.append(div.a["href"])
else:
sammy_urls.append(STEM_URL + div.a["href"])
restaurant_names = [x for x in div.a.content]
我已尝试div.a.br.content
,div.br
,但似乎无法做到正确。
如果建议使用RegEx方式,我也非常感谢nonRegEx方式。
答案 0 :(得分:1)
使用CSS selector和find the next text sibling找到每个商家信息的b
元素:
for b in soup.select("div.sammy > div.sammyListing > a > b"):
print b.find_next_sibling(text=True).strip()
演示:
In [1]: from urllib2 import urlopen
In [2]: from bs4 import BeautifulSoup
In [3]: soup = BeautifulSoup(urlopen('http://www.chicagomag.com/Chicago-Magazine/November-2012/Best-Sandwiches-Chicago/'))
In [4]: for b in soup.select("div.sammy > div.sammyListing > a > b"):
...: print b.find_next_sibling(text=True).strip()
...:
Old Oak Tap
Au Cheval
...
The Goddess and Grocer
Zenwich
Toni Patisserie
Phoebe’s Bakery