我很清楚我如何使用BeautifulSoup的ResultSet对象,即bs4.element.ResultSet
。
使用find_all()
后,如何提取文字?
示例:
在bs4
文档中,HTML文档html_doc
如下所示:
<p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">
Elsie
</a>
,
<a class="sister" href="http://example.com/lacie" id="link2">
Lacie
</a>
and
<a class="sister" href="http://example.com/tillie" id="link2">
Tillie
</a>
; and they lived at the bottom of a well.
</p>
首先创建soup
并找到所有href
,
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
soup.find_all('a')
输出
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
我们也可以
for link in soup.find_all('a'):
print(link.get('href'))
输出
http://example.com/elsie
http://example.com/lacie
http://example.com/tillie
我想仅来自class_="sister"
的文字,即
Elsie
Lacie
Tillie
可以尝试
for link in soup.find_all('a'):
print(link.get_text())
但这会导致错误:
AttributeError: 'ResultSet' object has no attribute 'get_text'
答案 0 :(得分:5)
对find_all()
进行class_='sister'
过滤。
注意: 请注意class
之后的下划线。这是一个特例,因为类是保留字。
搜索具有特定CSS类的标记非常有用,但是 CSS属性的名称“class”是Python中的保留字。 使用class作为关键字参数会给出语法错误。作为 美丽的汤4.1.2,你可以使用关键字搜索CSS类 参数
class_
:
来源: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-by-css-class
一旦你拥有了姐妹姐妹的所有标签,就可以在她们身上打.text
来获取文字。一定要删除文本。
例如:
from bs4 import BeautifulSoup
html_doc = '''<p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">
Elsie
</a>
,
<a class="sister" href="http://example.com/lacie" id="link2">
Lacie
</a>
and
<a class="sister" href="http://example.com/tillie" id="link2">
Tillie
</a>
; and they lived at the bottom of a well.
</p>'''
soup = BeautifulSoup(html_doc, 'html.parser')
sistertags = soup.find_all(class_='sister')
for tag in sistertags:
print tag.text.strip()
输出:
(bs4)macbook:bs4 joeyoung$ python bs4demo.py
Elsie
Lacie
Tillie