Question

我正在尝试使用beautifulsoup（被lxml调用）通过html进行解析。在嵌套标签上，我得到重复的文字

我尝试遍历并且仅计数没有子代的标签，但是随后我就失去了数据

给定：

<div class="links">
   <ul class="links inline">
      <li class="comment_forbidden first last">
         <span> to post comments</span>
      </li>
   </ul>
</div>

并运行：

soup = BeautifulSoup(file_info, features = "lxml")
soup.prettify().encode("utf-8")
    for tag in soup.find_all(True):
        if check_text(tag.text): #false on empty string/ all numbers 
            print (tag.text)

我有4次“发表评论”。是否有只一次获得结果的beautifulsoup方法？

Answer 1

您只能使用find()而不是find_all()来获得所需的结果

Answer 2

给出类似

的输入

<div class="links">
   <ul class="links inline">
      <li class="comment_forbidden first last">
         <span> to post comments1</span>
      </li>
   </ul>
</div>
<div class="links">
   <ul class="links inline">
      <li class="comment_forbidden first last">
         <span> to post comments2</span>
      </li>
   </ul>
</div>
<div class="links">
   <ul class="links inline">
      <li class="comment_forbidden first last">
         <span> to post comments3</span>
      </li>
   </ul>
</div>

您可以做类似的事情

[x.span.string for x in soup.find_all("li", class_="comment_forbidden first last")]

那会给

[' to post comments1', ' to post comments2', ' to post comments3']

find_all()用于查找类<li>的所有comment_forbidden first last标签，并且这些<span>标签内容的每个<li>子标签是使用他们的string属性。

在beautifulsoup嵌套标签中获取重复

2 个答案: