Question

如何使用python中的Beautifulsoup从以下HTML代码中提取所有<p>，请参阅下面的代码我正在尝试

HTML代码：

<div class="text_details">                  
<p>
Allah's Messenger (ﷺ) said: Islam is based on (the following) five (principles):
</p>
<p>  1. To testify that none has the right to be worshipped but Allah and Muhammad is Allah's Messenger (ﷺ).</p>
<p>  2. To offer the (compulsory congregational) prayers dutifully and perfectly.</p>
<p>  3. To pay Zakat (i.e. obligatory charity)</p>
<p>  4. To perform Hajj. (i.e. Pilgrimage to Mecca)</p>
<p>  5. To observe fast during the month of Ramadan.</p>
<p></p>
</div>

代码：

import requests
from bs4 import BeautifulSoup
url = "https://www.sunnah.com/bukhari/11"
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
for pp in soup.find_all(class_='text_details').p:
       print pp.text

Answer 1

您应该find（返回一个标记）div标记，然后find_all（返回标记列表）p标记

In [59]: for pp in soup.find(class_='text_details').find_all('p'):
    ...:     print(pp.text)
    ...:     
I heard Allah's Messenger (ﷺ) (p.b.u.h) saying, "We (Muslims) are the last (to come) but (will be) the 
foremost on the Day of Resurrection though the former nations were given the Holy Scriptures before 
us. And this was their day (Friday) the celebration of which was made compulsory for them but they 
differed about it. So Allah gave us the guidance for it (Friday) and all the other people are behind us in 
this respect: the Jews' (holy day is) tomorrow (i.e. Saturday) and the Christians' (is) the day after 
tomorrow (i.e. Sunday)."

div标记仅包含p标记，因此您可以通过以下方式获取所有文本：

In [60]: soup.find(class_='text_details').text

Answer 2

您可以使用select来抓取p作为父母的所有子text_details。

像这样：

import requests
from bs4 import BeautifulSoup

url = "https://www.sunnah.com/bukhari/11"
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
for pp in soup.select("div.text_details > p"):
    print (pp.text)

使用Beautiful Soup中的for循环提取多个<p>文本

2 个答案: