进入标头标签(h1,h2等)下面的标签时遇到一些问题。
进入此页面:
https://www.w3schools.com/python/ref_string_split.asp
让我说我想在“定义和用法”标题中获取文本,
`<h2>Definition and Usage</h2>`
如何在此行正下方引用<p>
块?
答案 0 :(得分:0)
您可以选择其嵌套的整个标记块,然后使用2个.split()函数:
import lxml
from bs4 import BeautifulSoup
CHROME_PATH = 'C:\Program Files (x86)\Google\Chrome\Application\chrome.exe'
CHROMEDRIVER_PATH = 'chromedriver.exe'
WINDOW_SIZE = "1920,1080"
chrome_options = Options()
chrome_options.add_argument("--log-level=3")
chrome_options.add_argument("--headless")
chrome_options.add_argument("--window-size=%s" % WINDOW_SIZE)
chrome_options.binary_location = CHROME_PATH
url = "Your Url" # Replace with your url
browser = webdriver.Chrome(executable_path=CHROMEDRIVER_PATH, chrome_options=chrome_options)
browser.get(url)
innerHTML = browser.execute_script("return document.body.innerHTML")
soup = BeautifulSoup(str(innerHTML.encode('utf-8').strip()), 'lxml')
# Identify the enclosing tag that will contain your <h2> tags (before the p)
source = soup.find('name of tag containing h2')
html = str(source).split("</p>")[0]
html = html.split("<h2>Definition and Usage</h2><p>")[1]
# This should give you everything between the tags that you specified.
print(html)
答案 1 :(得分:0)
假设您可以通过<h2>Definition and Usage</h2>
进入soup.find_all('h2')[7]
现在,元素p
同上一个元素,可以使用soup.find_all('h2')[7].next_sibling.next_sibling
进行提取
结果是
<p>The <code class="w3-codespan">split()</code> method splits a string into a list.</p>
注意:我们使用2 .next_sibling
,因为第一个到<h2>Definition and Usage</h2>
的兄弟姐妹是换行符,即\n
从官方beautifulsoup doc上查看有关用法的更多信息
答案 2 :(得分:0)
您可以使用find_next获取下一个标签。
from bs4 import BeautifulSoup
import requests
res=requests.get("https://www.w3schools.com/python/ref_string_split.asp")
soup=BeautifulSoup(res.text,"html5lib")
h2=soup.find("h2", string="Definition and Usage")
p_after_h2=h2.find_next("p")
p_text_after_h2=p_after_h2.text.replace("\n","")
print(p_after_h2)
print(p_text_after_h2)
输出
<p>The <code class="w3-codespan">split()</code> method splits a string into a
list.</p>
The split() method splits a string into a list.
页面的html是
...
<div class="w3-example">
<h3>Example</h3>
<p>Split a string into a list where each word is a list item:</p>
<div class="w3-code notranslate pythonHigh">
txt = "welcome to the jungle"<br><br>x = txt.split()<br><br>
print(x)</div>
<a target="_blank" class="w3-btn w3-margin-bottom" href="showpython.asp?filename=demo_ref_string_split">Run example »</a>
</div>
<hr>
<h2>Definition and Usage</h2>
<p>The <code class="w3-codespan">split()</code> method splits a string into a
list.</p>
<p>You can specify the separator, default separator is any whitespace.</p>
<div class="w3-panel w3-note">
<p><strong>Note:</strong> When max is specified, the list will contain the
specified number of elements <em>plus one</em>.</p>
</div>
<hr>
<h2>Syntax</h2>
<div class="w3-code w3-border notranslate">
<div>
<em>string</em>.split(<em>separator, max</em>)
</div>
</div>
...
这是我们的回复文字。使用
h2=soup.find("h2", string="Definition and Usage")
我们得到h2标签,其中包含“定义和用法”。然后我们使用
在此h2标签之后找到下一个pp_after_h2=h2.find_next("p")
最后我们使用
p_text_after_h2=p_after_h2.text.replace("\n","")
在删除换行符后,在p
标记内获取文本。