我正在使用带有Python的beautifulsoup4来抓取网页上的内容,我试图从特定的html标签中提取内容,而忽略其他内容。
我有以下html:
IDENTIFICATION DIVISION.
PROGRAM-ID. TASK1.
DATA DIVISION.
FILE SECTION.
WORKING-STORAGE SECTION.
01 SOURCE-STRING PIC X(50) VALUE " The length of string ".
01 LATTER-COUNTER PIC 99.
PROCEDURE DIVISION.
MAIN-PROCEDURE.
MOVE 0 TO LATTER-COUNTER
INSPECT SOURCE-STRING TALLYING LATTER-COUNTER FOR [???]
STOP RUN.
我的目标是了解如何指示python只从父<div class="the-one-i-want">
<p>
"random text content here and about"
</p>
<p>
"random text content here and about"
</p>
<p>
"random text content here and about"
</p>
<div class="random-inserted-element-i-dont-want">
<content>
</div>
<p>
"random text content here and about"
</p>
<p>
"random text content here and about"
</p>
</div>
中获取<p>
元素,否则忽略其中的所有<div> class="the-one-i-want">
。
目前,我通过以下方法找到父div的内容:
<div>
但是,我似乎无法弄清楚如何进一步指定只从中提取content = soup.find('div', class_='the-one-i-want')
标签而不会出错。
答案 0 :(得分:2)
h = """<div class="the-one-i-want">
<p>
"random text content here and about"
</p>
<p>
"random text content here and about"
</p>
<p>
"random text content here and about"
</p>
<div class="random-inserted-element-i-dont-want">
<content>
</div>
<p>
"random text content here and about"
</p>
<p>
"random text content here and about"
</p>
</div>"""
您可以在找到后使用find_all("p")
:
from bs4 import BeautifulSoup
soup = BeautifulSoup(h)
print(soup.find("div","the-one-i-want").find_all("p"))
或者使用css select:
print(soup.select("div.the-one-i-want p"))
两者都会给你:
[<p>\n "random text content here and about"\n </p>, <p>\n "random text content here and about"\n </p>, <p>\n "random text content here and about"\n </p>, <p>\n "random text content here and about"\n </p>, <p>\n "random text content here and about"\n </p>]
find_all
只能找到包含 the-one-i-want
类的div的后代,这同样适用于我们的 {{1} }