Question

我在Python中有一个简单的代码：

from bs4 import BeautifulSoup
import urllib2

webpage = urllib2.urlopen('http://fakepage.html')
soup = BeautifulSoup(webpage)

for anchor in soup.find_all("div", id="description"):
    print anchor

我几乎得到了我想要的东西，但在<div id=description>和</div>之间我得到了很多标签：

<div id="description"><div class="t"><p>some text to show <br><br> lots of <b> useless</b> tags </br></br></p></div></div>

我想只获取<div id=description>和</div>之间的文字（不是标签）来计算单词。 BeautifulSoup中有任何功能可以帮助我吗？

Answer 1

使用element.get_text() method获取文字

for anchor in soup.find_all("div", id="description"):
    print anchor.get_text()

您可以传递strip=True以删除额外的空格，第一个参数用于连接剥离的字符串：

for anchor in soup.find_all("div", id="description"):
    print anchor.get_text(' ', strip=True)

演示：

>>> from bs4 import BeautifulSoup
>>> sample = '''\
... <div id="description"><div class="t"><p>some text to show <br><br> lots of <b> useless</b> tags </br></br></p></div></div>
... '''
>>> soup = BeautifulSoup(sample)
>>> for anchor in soup.find_all("div", id="description"):
...     print anchor.get_text()
... 
some text to show  lots of  useless tags 
>>> for anchor in soup.find_all("div", id="description"):
...     print anchor.get_text(' ', strip=True)
... 
some text to show lots of useless tags

从div中删除标签

1 个答案: