Question

我需要从页面上的一个标签中提取带有文本的html标签。例如：

<html>
 <body>
  <div class="post">
   text <p> text </p> text <a> text </a>
   <span> text </span>
  <div class="post">
   another text <p> text </p>
 </body>
</html>

我首先需要html <div class="post">：

text <p> text </p> text <a> text </a>
   <span> text </span>

带标签。

我只能使用xpath提取文本："(//div[@class="post"])[1]/descendant-or-self::*[not(name()="script")]/text()" result = text text text text text

我尝试过："(//div[@class="post_body"])[1]/node()"但我不知道如何从中创建字符串。

P.S。或者提示另一种方式，例如（BeautifulSoup）请帮忙。

Answer 1

使用find()方法获取第一个div。

from bs4 import BeautifulSoup   
soup = BeautifulSoup("""<html>
     <body>
      <div class="post">
       text <p> text </p> text <a> text </a>
       <span> text </span></div>
      <div class="post">
       another text <p> text </p></div>
     </body>
    </html>""")

first_div_text = [child.strip() if isinstance(child, str) else str(child)  for child in soup.find('div', attrs={'class': 'post'})]
print(''.join(first_div_text))

输出

text<p> text </p>text<a> text </a><span> text </span>

xpath - 如何从一个标签中提取html？

1 个答案: