Beautifulsoup抓取</p> <h3>标签之间的<p>标签

时间:2018-11-08 17:52:24

标签: html web-scraping beautifulsoup web-crawler

因此,基本上,我想自动抓取p标签之间的所有文本({h3标签) 和不同
如何编写可捕获h3之间所有文本的代码?
例如,以下两行:

PARAGRAPH 1:

<p>If you only have one small tumor in your lung and there is no evidence of cancer in lymph nodes or elsewhere, your doctors may recommend <a href="/cancer/small-cell-lung-cancer/treating/surgery.html">surgery</a> to remove the tumor and the nearby lymph nodes.</p>
    <p>People who aren’t healthy enough for chemoradiation are usually treated with chemo by itself. This may be followed by radiation to the chest.</p>

PARAGRAPH 2:

<p>For most people with limited stage SCLC, surgery is not an option because the tumor is too large, it’s in a place that can’t be removed easily, or it has spread to nearby lymph nodes or other places in the lung. If you are in good health, the standard treatment is <a href="/cancer/small-cell-lung-cancer/treating/chemotherapy.html">

在下面的文本中。我想编写一个代码,而不是像指定包含p标签的行那样刻骨铭心。

因此,它可以自动直接抓取h3文本之间的文本,当然,可以在其他页面(并非仅仅是此页面)上进行检索。

<h3>Stage I cancers</h3>
<p>If you only have one small tumor in your lung and there is no evidence of cancer in lymph nodes or elsewhere, your doctors may recommend <a href="/cancer/small-cell-lung-cancer/treating/surgery.html">surgery</a> to remove the tumor and the nearby lymph nodes.</p>
<p>People who aren’t healthy enough for chemoradiation are usually treated with chemo by itself. This may be followed by radiation to the chest.</p>
<h3>Other limited stage cancers</h3>
<p>For most people with limited stage SCLC, surgery is not an option because the tumor is too large, it’s in a place that can’t be removed easily, or it has spread to nearby lymph nodes or other places in the lung. If you are in good health, the standard treatment is <a href="/cancer/small-cell-lung-cancer/treating/chemotherapy.html">

我该怎么办?

3 个答案:

答案 0 :(得分:0)

如果文本已经在变量中,则from bs4 import BeautifulSoup并运行下面的代码。否则,如果您尝试导航到网站并抓取页面,则它会稍有不同,因为您需要import requestsurl = 'whatever website'添加一个变量,然后为page = 'requests.get(url)' and finally instead of the code below, soup =一个变量= BeautifulSoup(page.text,'lxml')。保留find变量和for循环。所有这些都假设您试图仅捕获页面上的所有<h3>标签。

html = """<h3>Stage I cancers</h3><p>If you only have one small tumor in your lung and there is no evidence of cancer in lymph nodes or elsewhere, your doctors may recommend <a href='/cancer/small-cell-lung-cancer/treating/surgery.html'>surgery</a> to remove the tumor and the nearby lymph nodes.</p><p>People who aren’t healthy enough for chemoradiation are usually treated with chemo by itself. This may be followed by radiation to the chest.</p><h3>Other limited stage cancers</h3><p>For most people with limited stage SCLC, surgery is not an option because the tumor is too large, it’s in a place that can’t be removed easily, or it has spread to nearby lymph nodes or other places in the lung. If you are in good health, the standard treatment is <a href='/cancer/small-cell-lung-cancer/treating/chemotherapy.html'>"""
soup = BeautifulSoup(html, 'lxml')
find = soup.findAll('h3')
for h3 in find:
    print(h3.text)

答案 1 :(得分:0)

:mergeReleaseResources
[string/fb_app_id] C:\nodejs7\la3\platforms\android\res\values\facebookconnect.x
ml      [string/fb_app_id] C:\nodejs7\la3\platforms\android\res\values\strings.x
ml: Error: Duplicate resources
[string/fb_app_name] C:\nodejs7\la3\platforms\android\res\values\facebookconnect
.xml    [string/fb_app_name] C:\nodejs7\la3\platforms\android\res\values\strings
.xml: Error: Duplicate resources
:mergeReleaseResources FAILED


BUILD FAILED
FAILURE: Build failed with an exception.

* What went wrong:
Execution failed for task ':mergeReleaseResources'.
> [string/fb_app_id] C:\nodejs7\la3\platforms\android\res\values\facebookconnect
.xml    [string/fb_app_id] C:\nodejs7\la3\platforms\android\res\values\strings.x
ml: Error: Duplicate resources
[string/fb_app_name] C:\nodejs7\la3\platforms\android\res\values\facebookconnect
.xml    [string/fb_app_name] C:\nodejs7\la3\platforms\android\res\values\strings
.xml: Error: Duplicate resources

* Try:
Run with --stacktrace option to get the stack trace. Run with --info or --debug
option to get more log output.

Total time: 13.303 secs
cmd: Command failed with exit code 1 Error output:
FAILURE: Build failed with an exception.

* What went wrong:
Execution failed for task ':mergeReleaseResources'.
> [string/fb_app_id] C:\nodejs7\la3\platforms\android\res\values\facebookconnect
.xml    [string/fb_app_id] C:\nodejs7\la3\platforms\android\res\values\strings.x
ml: Error: Duplicate resources
[string/fb_app_name] C:\nodejs7\la3\platforms\android\res\values\facebookconnect
.xml    [string/fb_app_name] C:\nodejs7\la3\platforms\android\res\values\strings
.xml: Error: Duplicate resources

答案 2 :(得分:0)

使用find_next_sibling()

from bs4 import BeautifulSoup

html = '''<h3>Stage I cancers</h3>
<p>If you only have one small tumoremove</p>
<p>People who arent healthy enough.</p>
<h2>Skip this</h2>
<p>also Skip this</p>
<h3>Other limited stage cancers</h3>
<p>For most people with limited stage SCLC</p>'''

soup = BeautifulSoup(html, 'html.parser')
for section in soup.findAll('h3'):
    nextNode = section
    print "=================== %s ===================" % section.text
    while True:
        nextNode = nextNode.find_next_sibling()
        if nextNode and nextNode.name == 'p':
            print nextNode
        else:
            print "-------------------- h3 end --------------------\n"
            break