我正在尝试提取出现在<p>
标记之后的所有<i>
标记,直到遇到<h1>
,然后再次重复。
示例html代码:
<h1><h1>
<p></p>
<i></i>
<p>Need to grab this</p>
<p>Need to grab this</p>
<p>Need to grab this</p>
<p>Need to grab this</p>
<h1><h1>
<p></p>
问题是<p>
的不同之处在于“需要抓住这个问题”,因此有时可能会有一个,三个甚至八个。
如何创建循环并将它们全部捕获,直到再次下一个<h1>
标记?
此刻我正在使用BeautifulSoup。
这是我当前的python代码:
headernum = 0
i = 0
x = soup.find_all("h1")
for i in range(len(x)):
header = soup.find_all('h1')[headernum]
name = header.find_all_next('p')[1]
print(name.text)
workplace = name.find_all_next('i')[0]
print(workplace.text)
abstract = workplace.find_all_next('p')[1].get_text()
print(abstract)
i += 1
headernum += 1
答案 0 :(得分:1)
您可以遍历element.next_siblings
iterator;给定一个起始元素,循环遍历以下同级元素,直到达到结束条件为止:
for elem in start.next_siblings:
if elem.name == 'h1':
break
if elem.name != 'p':
continue
# it's a <p> tag before the next <h1>
...
演示:
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('''
... <h1><h1>
... <p></p>
... <i></i>
... <p>Need to grab this</p>
... <p>Need to grab this</p>
... <p>Need to grab this</p>
... <p>Need to grab this</p>
... <h1><h1>
... <p></p>
... ''')
>>> start = soup.find('i')
>>> for elem in start.next_siblings:
... if elem.name == 'h1':
... break
... if elem.name != 'p':
... continue
... print(elem)
...
<p>Need to grab this</p>
<p>Need to grab this</p>
<p>Need to grab this</p>
<p>Need to grab this</p>
与您现有的代码结合(略微完善):
for header in soup.find_all("h1"):
name = header.find_next_siblings('p', limit=2)[-1]
print(name.text)
workplace = name.find_next_sibling('i')
print(workplace.text)
abstract = []
for elem in name.next_siblings:
if elem.name == 'h1':
break
if elem.name != 'p':
continue
# it's a <p> tag before the next <h1>
abstract.append(elem.get_text())
print('\n'.join(abstract))
答案 1 :(得分:-1)
使用xpath
可以这样解决:
//h1/following-sibling::p
这应该为您提供p
的所有h1
兄弟。