Question

我有以下html：

<div class="date_on_by"> 
<a sasource="qp_focused" href="/author/bill-maurer/articles">Bill Maurer</a>  
<span class="bullet">•</span> Yesterday, 9:33 AM 
<span class="bullet">•</span> 
<span class="comments">98&nbsp;Comments</span> 
</div>

如果我使用text.find_all（'div'，class _ =“date_on_by”）。getText（），则返回“

Bill Maurer • Yesterday, 9:33 AM • 98 Comments

但我真正想要的只是：

Yesterday, 9:33 AM

不在任何儿童内容中。怎么做？

Answer 1

我明白了！

for date in text.find_all('div',class_="date_on_by"):
        dates.append(re.split(text.find_all('span',class_="bullet")[0].getText(),date.getText())[1])

Answer 2

您可以使用span类名称和 next_sibling ：

 In [9]: h = """<div class="date_on_by">
   ...: <a sasource="qp_focused" href="/author/bill-maurer/articles">Bill Maurer</a>
   ...: <span class="bullet">•</span> Yesterday, 9:33 AM
   ...: <span class="bullet">•</span>
   ...: <span class="comments">98&nbsp;Comments</span>
   ...: </div>"""

In [10]: from bs4 import BeautifulSoup

In [11]: soup = BeautifulSoup(h)

In [12]: print(soup.select_one("div.date_on_by span.bullet").next_sibling.strip())
Yesterday, 9:33 AM

另外，如果您只想要第一个元素，则应使用.find代替find_all(..)[0]。

beautifulsoup选择父内容而不是儿童内容

2 个答案: