Question

我很清楚我如何使用BeautifulSoup的ResultSet对象，即bs4.element.ResultSet。

使用find_all()后，如何提取文字？

示例：

在bs4文档中，HTML文档html_doc如下所示：

<p class="story">
    Once upon a time there were three little sisters; and their names were
    <a class="sister" href="http://example.com/elsie" id="link1">
     Elsie
    </a>
    ,
    <a class="sister" href="http://example.com/lacie" id="link2">
     Lacie
    </a>
    and
    <a class="sister" href="http://example.com/tillie" id="link2">
     Tillie
    </a>
    ; and they lived at the bottom of a well.
   </p>

首先创建soup并找到所有href，

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
soup.find_all('a')

输出

 [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, 
  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

我们也可以

for link in soup.find_all('a'):
    print(link.get('href'))

输出

http://example.com/elsie
http://example.com/lacie
http://example.com/tillie

我想仅来自class_="sister"的文字，即

Elsie
Lacie
Tillie

可以尝试

for link in soup.find_all('a'):
    print(link.get_text())

但这会导致错误：

AttributeError: 'ResultSet' object has no attribute 'get_text'

Answer 1

对find_all()进行class_='sister'过滤。

注意： 请注意class之后的下划线。这是一个特例，因为类是保留字。

搜索具有特定CSS类的标记非常有用，但是 CSS属性的名称“class”是Python中的保留字。使用class作为关键字参数会给出语法错误。作为美丽的汤4.1.2，你可以使用关键字搜索CSS类参数class_：

来源： http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-by-css-class

一旦你拥有了姐妹姐妹的所有标签，就可以在她们身上打.text来获取文字。一定要删除文本。

例如：

from bs4 import BeautifulSoup

html_doc = '''<p class="story">
    Once upon a time there were three little sisters; and their names were
    <a class="sister" href="http://example.com/elsie" id="link1">
     Elsie
    </a>
    ,
    <a class="sister" href="http://example.com/lacie" id="link2">
     Lacie
    </a>
    and
    <a class="sister" href="http://example.com/tillie" id="link2">
     Tillie
    </a>
    ; and they lived at the bottom of a well.
   </p>'''

soup = BeautifulSoup(html_doc, 'html.parser')
sistertags = soup.find_all(class_='sister')
for tag in sistertags:
    print tag.text.strip()

输出：

(bs4)macbook:bs4 joeyoung$ python bs4demo.py
Elsie
Lacie
Tillie

BeautifulSoup，在HTML标记中提取字符串，ResultSet对象

1 个答案: