Question

因此，基本上，我想自动抓取p标签之间的所有文本（{h3标签）和不同。
如何编写可捕获h3之间所有文本的代码？
例如，以下两行：

PARAGRAPH 1：

<p>If you only have one small tumor in your lung and there is no evidence of cancer in lymph nodes or elsewhere, your doctors may recommend <a href="/cancer/small-cell-lung-cancer/treating/surgery.html">surgery</a> to remove the tumor and the nearby lymph nodes.</p>
    <p>People who aren’t healthy enough for chemoradiation are usually treated with chemo by itself. This may be followed by radiation to the chest.</p>

PARAGRAPH 2：

<p>For most people with limited stage SCLC, surgery is not an option because the tumor is too large, it’s in a place that can’t be removed easily, or it has spread to nearby lymph nodes or other places in the lung. If you are in good health, the standard treatment is <a href="/cancer/small-cell-lung-cancer/treating/chemotherapy.html">

在下面的文本中。我想编写一个代码，而不是像指定包含p标签的行那样刻骨铭心。

因此，它可以自动和直接抓取h3文本之间的文本，当然，可以在其他页面（并非仅仅是此页面）上进行检索。

<h3>Stage I cancers</h3>
<p>If you only have one small tumor in your lung and there is no evidence of cancer in lymph nodes or elsewhere, your doctors may recommend <a href="/cancer/small-cell-lung-cancer/treating/surgery.html">surgery</a> to remove the tumor and the nearby lymph nodes.</p>
<p>People who aren’t healthy enough for chemoradiation are usually treated with chemo by itself. This may be followed by radiation to the chest.</p>
<h3>Other limited stage cancers</h3>
<p>For most people with limited stage SCLC, surgery is not an option because the tumor is too large, it’s in a place that can’t be removed easily, or it has spread to nearby lymph nodes or other places in the lung. If you are in good health, the standard treatment is <a href="/cancer/small-cell-lung-cancer/treating/chemotherapy.html">

我该怎么办？

Answer 1

如果文本已经在变量中，则from bs4 import BeautifulSoup并运行下面的代码。否则，如果您尝试导航到网站并抓取页面，则它会稍有不同，因为您需要import requests为url = 'whatever website'添加一个变量，然后为page = 'requests.get(url)' and finally instead of the code below, soup =一个变量= BeautifulSoup（page.text，'lxml'）。保留find变量和for循环。所有这些都假设您试图仅捕获页面上的所有<h3>标签。

html = """<h3>Stage I cancers</h3><p>If you only have one small tumor in your lung and there is no evidence of cancer in lymph nodes or elsewhere, your doctors may recommend <a href='/cancer/small-cell-lung-cancer/treating/surgery.html'>surgery</a> to remove the tumor and the nearby lymph nodes.</p><p>People who aren’t healthy enough for chemoradiation are usually treated with chemo by itself. This may be followed by radiation to the chest.</p><h3>Other limited stage cancers</h3><p>For most people with limited stage SCLC, surgery is not an option because the tumor is too large, it’s in a place that can’t be removed easily, or it has spread to nearby lymph nodes or other places in the lung. If you are in good health, the standard treatment is <a href='/cancer/small-cell-lung-cancer/treating/chemotherapy.html'>"""
soup = BeautifulSoup(html, 'lxml')
find = soup.findAll('h3')
for h3 in find:
    print(h3.text)

Answer 2

:mergeReleaseResources
[string/fb_app_id] C:\nodejs7\la3\platforms\android\res\values\facebookconnect.x
ml      [string/fb_app_id] C:\nodejs7\la3\platforms\android\res\values\strings.x
ml: Error: Duplicate resources
[string/fb_app_name] C:\nodejs7\la3\platforms\android\res\values\facebookconnect
.xml    [string/fb_app_name] C:\nodejs7\la3\platforms\android\res\values\strings
.xml: Error: Duplicate resources
:mergeReleaseResources FAILED


BUILD FAILED
FAILURE: Build failed with an exception.

* What went wrong:
Execution failed for task ':mergeReleaseResources'.
> [string/fb_app_id] C:\nodejs7\la3\platforms\android\res\values\facebookconnect
.xml    [string/fb_app_id] C:\nodejs7\la3\platforms\android\res\values\strings.x
ml: Error: Duplicate resources
[string/fb_app_name] C:\nodejs7\la3\platforms\android\res\values\facebookconnect
.xml    [string/fb_app_name] C:\nodejs7\la3\platforms\android\res\values\strings
.xml: Error: Duplicate resources

* Try:
Run with --stacktrace option to get the stack trace. Run with --info or --debug
option to get more log output.

Total time: 13.303 secs
cmd: Command failed with exit code 1 Error output:
FAILURE: Build failed with an exception.

* What went wrong:
Execution failed for task ':mergeReleaseResources'.
> [string/fb_app_id] C:\nodejs7\la3\platforms\android\res\values\facebookconnect
.xml    [string/fb_app_id] C:\nodejs7\la3\platforms\android\res\values\strings.x
ml: Error: Duplicate resources
[string/fb_app_name] C:\nodejs7\la3\platforms\android\res\values\facebookconnect
.xml    [string/fb_app_name] C:\nodejs7\la3\platforms\android\res\values\strings
.xml: Error: Duplicate resources

Answer 3

使用find_next_sibling()

from bs4 import BeautifulSoup

html = '''<h3>Stage I cancers</h3>
<p>If you only have one small tumoremove</p>
<p>People who arent healthy enough.</p>
<h2>Skip this</h2>
<p>also Skip this</p>
<h3>Other limited stage cancers</h3>
<p>For most people with limited stage SCLC</p>'''

soup = BeautifulSoup(html, 'html.parser')
for section in soup.findAll('h3'):
    nextNode = section
    print "=================== %s ===================" % section.text
    while True:
        nextNode = nextNode.find_next_sibling()
        if nextNode and nextNode.name == 'p':
            print nextNode
        else:
            print "-------------------- h3 end --------------------\n"
            break

Beautifulsoup抓取</p> <h3>标签之间的<p>标签

3 个答案: