Question

有下一个块

＆＃13;

<div class="text">
  <h1>Headerh1</h1>
   Text1 <br/> after header1 
  <h3>Headerh3.1</h3> 
     Text2 <br/> after header3.1 
  <h3>Headerh3.2</h3>
    Text3 <br/> after header3.2 
  <h3>Headerh3.3</h3>
    Text4 <br/> after header3.3 
</div>

＆＃13;

如何在第一个H1之后获取文本而忽略<br/><br/>为＆＃34; header1＆＃34之后的Text1; //div[@class='text']/text()[count(preceding-sibling::h1)=1]在所有标题后返回文字。 <br>可以是0次以上

Answer 1

尝试使用下面的XPath，它应该返回位于第一个div之前的h3的所有文本节点：

//div[@class='text']/h3[1]/preceding-sibling::text()

Answer 2

我假设这是你目录中的一个html并且它被调用了 demo.html

from bs4 import BeautifulSoup

with open("demo.html") as f:
    data = f.read()
    soup = BeautifulSoup(data, 'html.parser')
    f.close()

#to get the text after h1 tag
 h1 = soup.find('h1').text
#to get the text after all h3 tags
 h3 = [i.text for i in soup.findAll('h3')]

输出将采用unicode格式例如：

h3 = [u'Headerh3.1', u'Headerh3.2', u'Headerh3.3']

将它们转换为普通字符串执行此操作

h3 = [i.text.encode('utf-8') for i in soup.findAll('h3')]
h1 = soup.find('h1').text.encode('utf-8')

Xpath仅在第一个html标记之后获取文本

2 个答案: