因此,基本上,我想自动抓取p
标签之间的所有文本({h3
标签) 和不同。
如何编写可捕获h3
之间所有文本的代码?
例如,以下两行:
PARAGRAPH 1:
<p>If you only have one small tumor in your lung and there is no evidence of cancer in lymph nodes or elsewhere, your doctors may recommend <a href="/cancer/small-cell-lung-cancer/treating/surgery.html">surgery</a> to remove the tumor and the nearby lymph nodes.</p>
<p>People who aren’t healthy enough for chemoradiation are usually treated with chemo by itself. This may be followed by radiation to the chest.</p>
PARAGRAPH 2:
<p>For most people with limited stage SCLC, surgery is not an option because the tumor is too large, it’s in a place that can’t be removed easily, or it has spread to nearby lymph nodes or other places in the lung. If you are in good health, the standard treatment is <a href="/cancer/small-cell-lung-cancer/treating/chemotherapy.html">
在下面的文本中。我想编写一个代码,而不是像指定包含p
标签的行那样刻骨铭心。
因此,它可以自动和直接抓取h3
文本之间的文本,当然,可以在其他页面(并非仅仅是此页面)上进行检索。
<h3>Stage I cancers</h3>
<p>If you only have one small tumor in your lung and there is no evidence of cancer in lymph nodes or elsewhere, your doctors may recommend <a href="/cancer/small-cell-lung-cancer/treating/surgery.html">surgery</a> to remove the tumor and the nearby lymph nodes.</p>
<p>People who aren’t healthy enough for chemoradiation are usually treated with chemo by itself. This may be followed by radiation to the chest.</p>
<h3>Other limited stage cancers</h3>
<p>For most people with limited stage SCLC, surgery is not an option because the tumor is too large, it’s in a place that can’t be removed easily, or it has spread to nearby lymph nodes or other places in the lung. If you are in good health, the standard treatment is <a href="/cancer/small-cell-lung-cancer/treating/chemotherapy.html">
我该怎么办?
答案 0 :(得分:0)
如果文本已经在变量中,则from bs4 import BeautifulSoup
并运行下面的代码。否则,如果您尝试导航到网站并抓取页面,则它会稍有不同,因为您需要import requests
为url = 'whatever website'
添加一个变量,然后为page = 'requests.get(url)' and finally instead of the code below,
soup =一个变量= BeautifulSoup(page.text,'lxml')。保留find
变量和for循环。所有这些都假设您试图仅捕获页面上的所有<h3>
标签。
html = """<h3>Stage I cancers</h3><p>If you only have one small tumor in your lung and there is no evidence of cancer in lymph nodes or elsewhere, your doctors may recommend <a href='/cancer/small-cell-lung-cancer/treating/surgery.html'>surgery</a> to remove the tumor and the nearby lymph nodes.</p><p>People who aren’t healthy enough for chemoradiation are usually treated with chemo by itself. This may be followed by radiation to the chest.</p><h3>Other limited stage cancers</h3><p>For most people with limited stage SCLC, surgery is not an option because the tumor is too large, it’s in a place that can’t be removed easily, or it has spread to nearby lymph nodes or other places in the lung. If you are in good health, the standard treatment is <a href='/cancer/small-cell-lung-cancer/treating/chemotherapy.html'>"""
soup = BeautifulSoup(html, 'lxml')
find = soup.findAll('h3')
for h3 in find:
print(h3.text)
答案 1 :(得分:0)
:mergeReleaseResources
[string/fb_app_id] C:\nodejs7\la3\platforms\android\res\values\facebookconnect.x
ml [string/fb_app_id] C:\nodejs7\la3\platforms\android\res\values\strings.x
ml: Error: Duplicate resources
[string/fb_app_name] C:\nodejs7\la3\platforms\android\res\values\facebookconnect
.xml [string/fb_app_name] C:\nodejs7\la3\platforms\android\res\values\strings
.xml: Error: Duplicate resources
:mergeReleaseResources FAILED
BUILD FAILED
FAILURE: Build failed with an exception.
* What went wrong:
Execution failed for task ':mergeReleaseResources'.
> [string/fb_app_id] C:\nodejs7\la3\platforms\android\res\values\facebookconnect
.xml [string/fb_app_id] C:\nodejs7\la3\platforms\android\res\values\strings.x
ml: Error: Duplicate resources
[string/fb_app_name] C:\nodejs7\la3\platforms\android\res\values\facebookconnect
.xml [string/fb_app_name] C:\nodejs7\la3\platforms\android\res\values\strings
.xml: Error: Duplicate resources
* Try:
Run with --stacktrace option to get the stack trace. Run with --info or --debug
option to get more log output.
Total time: 13.303 secs
cmd: Command failed with exit code 1 Error output:
FAILURE: Build failed with an exception.
* What went wrong:
Execution failed for task ':mergeReleaseResources'.
> [string/fb_app_id] C:\nodejs7\la3\platforms\android\res\values\facebookconnect
.xml [string/fb_app_id] C:\nodejs7\la3\platforms\android\res\values\strings.x
ml: Error: Duplicate resources
[string/fb_app_name] C:\nodejs7\la3\platforms\android\res\values\facebookconnect
.xml [string/fb_app_name] C:\nodejs7\la3\platforms\android\res\values\strings
.xml: Error: Duplicate resources
答案 2 :(得分:0)
from bs4 import BeautifulSoup
html = '''<h3>Stage I cancers</h3>
<p>If you only have one small tumoremove</p>
<p>People who arent healthy enough.</p>
<h2>Skip this</h2>
<p>also Skip this</p>
<h3>Other limited stage cancers</h3>
<p>For most people with limited stage SCLC</p>'''
soup = BeautifulSoup(html, 'html.parser')
for section in soup.findAll('h3'):
nextNode = section
print "=================== %s ===================" % section.text
while True:
nextNode = nextNode.find_next_sibling()
if nextNode and nextNode.name == 'p':
print nextNode
else:
print "-------------------- h3 end --------------------\n"
break