使用特定单词在标签之间提取文本

时间:2017-07-26 20:12:29

标签: html python-2.7 web-scraping beautifulsoup

我尝试使用关键字在HTML网页的标签之间提取文字。这是一个例子。

<div class="xyz">Title</div>
<h4>Education</h4>
<p>PhD, 2017, Subject,<br />
   ABC University </p>

我想要抓取"PhD, 2017, Subject, ABC University"。这是我试过的:

r = requests.get(site)
soup = BeautifulSoup(r.content, "lxml")
for elems in soup(text=re.compile('PhD')):
    val = elems.find_parent('p').getText()

这是打印所有&#39; p&#39;标签包含&#34; PhD&#34;,有人可以建议我如何获得&#34;教育&#34;领域?我也试过使用分区,但没有提供成功的结果。

2 个答案:

答案 0 :(得分:1)

您可以尝试使用lxml.html来获取所需的文字:

import lxml.html as html

source = requests.get(site).content
html_obj = html.fromstring(source)
my_text = " ".join([text.strip() for text in html_obj.xpath('//h4[.="Education"]/following-sibling::p/text()')])
print(my_text)

输出

'PhD, 2017, Subject, ABC University'

答案 1 :(得分:1)

使用BeautifulSoup,您可以:

import bs4 as bs
text = """<div class="xyz">Title</div>
    <h4>Not Education</h4>
    <p>PhD, 2016, Subject,<br />
     DEF University </p>
    <div class="xyz">Title</div>
    <h4>Education</h4>
    <p>PhD, 2017, Subject,<br />
     ABC University </p>"""

soup = bs.BeautifulSoup(text, "lxml")
header = soup.find('h4', text='Education')
val = header.find_next_sibling('p').getText()
print (val)

输出:

PhD, 2017, Subject,
     ABC University