Question

我几乎掌握了Python中的BeautifulSoup4，但我似乎无法为HTML数据中的br标签提取<br/>数据。

数据结构：

<HTML and CSS Stuff here>
<div class="menu">
<span class="author">Bob</span> 
<span class="smaller">(06 Jul at 09:21)</span>
<br/>This message is very important to extract along with the matching author and time of submit<br/>
</div>

我正在寻找的是：

Author: Bob
Time: (06 Jul at 09:21)
Data: This message is very important to extract along with the matching author and time of submit

HTML通过requests传入，一切正常。但我只是没有让汤混合正确。

当前代码：

from bs4 import BeautifulSoup
import requests
html_doc = """
<HTML and CSS Stuff here>
<div class="menu">
<span class="author">Bob</span> 
<span class="smaller">(06 Jul at 09:21)</span>
<br/>This message is very important to extract along with the matching author and time of submit<br/>
</div>
"""

html_doc = r.text
soup = BeautifulSoup(html_doc, 'html.parser')

x = soup.select('div[class="menu"]')
for i in x:
    s = soup.select('span[class="author"]')
    rr = soup.select('span[class="smaller"]')
    for b in s:
        print b
        print rr

Answer 1

<br/>标记始终为空标记。该标签中没有文字。

你所拥有的是两个<br/>标签之间的文字，这可能令人困惑。您可以删除任一标记，它仍然是有效的HTML。

您可以使用.next_sibling attribute
获取标记后面的文字
soup.select('div.menu br')[0].next_sibling

演示：

>>> from bs4 import BeautifulSoup >>> html_doc = """ ... <HTML and CSS Stuff here> ... <div class="menu"> ... <span class="author">Bob</span> ... <span class="smaller">(06 Jul at 09:21)</span> ... <br/>This message is very important to extract along with the matching author and time of submit<br/> ... </div> ... """ >>> soup = BeautifulSoup(html_doc) >>> soup.select('div.menu br')[0].next_sibling u'This message is very important to extract along with the matching author and time of submit'

将它与提取所有数据放在一起：

for menu in soup.select('div.menu'): author = menu.find('span', class_='author').get_text() time = menu.find('span', class_='smaller').get_text() data = menu.find('br').next_sibling

产生：

>>> for menu in soup.select('div.menu'): ... author = menu.find('span', class_='author').get_text() ... time = menu.find('span', class_='smaller').get_text() ... data = menu.find('br').next_sibling ... print 'Author: {}\nTime: {}\nData: {}'.format(author, time, data) ... Author: Bob Time: (06 Jul at 09:21) Data: This message is very important to extract along with the matching author and time of submit

从BeautifulSoup中的br标签获取文本

1 个答案: