我在Python中有一个简单的代码:
from bs4 import BeautifulSoup
import urllib2
webpage = urllib2.urlopen('http://fakepage.html')
soup = BeautifulSoup(webpage)
for anchor in soup.find_all("div", id="description"):
print anchor
我几乎得到了我想要的东西,但在<div id=description>
和</div>
之间我得到了很多标签:
<div id="description"><div class="t"><p>some text to show <br><br> lots of <b> useless</b> tags </br></br></p></div></div>
我想只获取<div id=description>
和</div>
之间的文字(不是标签)来计算单词。
BeautifulSoup中有任何功能可以帮助我吗?
答案 0 :(得分:2)
使用element.get_text()
method获取 文字
for anchor in soup.find_all("div", id="description"):
print anchor.get_text()
您可以传递strip=True
以删除额外的空格,第一个参数用于连接剥离的字符串:
for anchor in soup.find_all("div", id="description"):
print anchor.get_text(' ', strip=True)
演示:
>>> from bs4 import BeautifulSoup
>>> sample = '''\
... <div id="description"><div class="t"><p>some text to show <br><br> lots of <b> useless</b> tags </br></br></p></div></div>
... '''
>>> soup = BeautifulSoup(sample)
>>> for anchor in soup.find_all("div", id="description"):
... print anchor.get_text()
...
some text to show lots of useless tags
>>> for anchor in soup.find_all("div", id="description"):
... print anchor.get_text(' ', strip=True)
...
some text to show lots of useless tags