Question

我正在使用带有Python的beautifulsoup4来抓取网页上的内容，我试图从特定的html标签中提取内容，而忽略其他内容。

我有以下html：

IDENTIFICATION DIVISION.
PROGRAM-ID. TASK1.
DATA DIVISION.
FILE SECTION.
WORKING-STORAGE SECTION.
01 SOURCE-STRING PIC X(50) VALUE "  The length of string    ".
01 LATTER-COUNTER PIC 99.
PROCEDURE DIVISION.
MAIN-PROCEDURE.
    MOVE 0 TO LATTER-COUNTER
    INSPECT SOURCE-STRING TALLYING LATTER-COUNTER FOR [???]
STOP RUN.

我的目标是了解如何指示python只从父<div class="the-one-i-want"> <p> "random text content here and about" </p> <p> "random text content here and about" </p> <p> "random text content here and about" </p> <div class="random-inserted-element-i-dont-want"> <content> </div> <p> "random text content here and about" </p> <p> "random text content here and about" </p> </div>中获取<p>元素，否则忽略其中的所有<div> class="the-one-i-want">。

目前，我通过以下方法找到父div的内容：

<div>

但是，我似乎无法弄清楚如何进一步指定只从中提取content = soup.find('div', class_='the-one-i-want')标签而不会出错。

Answer 1

h = """<div class="the-one-i-want">
    <p>
        "random text content here and about"
    </p>
    <p>
        "random text content here and about"
    </p>
    <p>
        "random text content here and about"
    </p>
    <div class="random-inserted-element-i-dont-want">
        <content>
    </div>
    <p>
        "random text content here and about"
    </p>
    <p>
        "random text content here and about"
    </p>
</div>"""

您可以在找到后使用find_all("p")：

from bs4 import BeautifulSoup
soup = BeautifulSoup(h)

print(soup.find("div","the-one-i-want").find_all("p"))

或者使用css select：

print(soup.select("div.the-one-i-want p"))

两者都会给你：

[<p>\n        "random text content here and about"\n    </p>, <p>\n        "random text content here and about"\n    </p>, <p>\n        "random text content here and about"\n    </p>, <p>\n        "random text content here and about"\n    </p>, <p>\n        "random text content here and about"\n    </p>]

find_all 只能找到包含 the-one-i-want 类的div的后代，这同样适用于我们的 {{1} }

使用Beautifulsoup4在父标记内获取某些标记

1 个答案: