我尝试使用关键字在HTML网页的标签之间提取文字。这是一个例子。
<div class="xyz">Title</div>
<h4>Education</h4>
<p>PhD, 2017, Subject,<br />
ABC University </p>
我想要抓取"PhD, 2017, Subject, ABC University"
。这是我试过的:
r = requests.get(site)
soup = BeautifulSoup(r.content, "lxml")
for elems in soup(text=re.compile('PhD')):
val = elems.find_parent('p').getText()
这是打印所有&#39; p&#39;标签包含&#34; PhD&#34;,有人可以建议我如何获得&#34;教育&#34;领域?我也试过使用分区,但没有提供成功的结果。
答案 0 :(得分:1)
您可以尝试使用lxml.html
来获取所需的文字:
import lxml.html as html
source = requests.get(site).content
html_obj = html.fromstring(source)
my_text = " ".join([text.strip() for text in html_obj.xpath('//h4[.="Education"]/following-sibling::p/text()')])
print(my_text)
输出
'PhD, 2017, Subject, ABC University'
答案 1 :(得分:1)
使用BeautifulSoup,您可以:
import bs4 as bs
text = """<div class="xyz">Title</div>
<h4>Not Education</h4>
<p>PhD, 2016, Subject,<br />
DEF University </p>
<div class="xyz">Title</div>
<h4>Education</h4>
<p>PhD, 2017, Subject,<br />
ABC University </p>"""
soup = bs.BeautifulSoup(text, "lxml")
header = soup.find('h4', text='Education')
val = header.find_next_sibling('p').getText()
print (val)
输出:
PhD, 2017, Subject,
ABC University