Question

我想解析这个html块：

<div class="class123">
  <div><strong>title123</strong>
    <span style="something123">something else</span>
  </div>

  I want to parse this, how can do that?
</div>

我如何用beautifulsoup解析它？我知道如何解析标签内的内容，但是如何解析同一级别的内容？

soup1.find("div", class_="class123")

抓住第一个div内的所有内容

Answer 1

您可以将div内容迭代为

>>> from bs4 import NavigableString
>>> for x in soup.find("div", class_="class123").contents:
...     if isinstance(x, NavigableString):
...             print x.strip()
...

I want to parse this, how can do that?

content将是父级中包含的Tag和NavigableString个对象的列表。

此处NavigableString是不包含任何子元素的字符串。

Answer 2

我认为你要问的是如何提取元素中包含的文本，而不是子元素或子元素中包含的文本。

您可以使用.findall(text=True, recursive=False)（请参阅Only extracting text from this element, not its children）。

>>> from bs4 import BeautifulSoup
>>> soup=BeautifulSoup(
...     """<div class="class123">
...   <div><strong>title123</strong>
...     <span style="something123">something else</span>
...   </div>
... 
...   I want to parse this, how can do that?
... </div>""", 'lxml')
>>> 
>>> print(soup.find("div", class_="class123").find_all(text=True, recursive=False))
['\n', '\n\n  I want to parse this, how can do that?\n']

如果有多个匹配的<div>元素，您将不得不循环使用它们

>>> for result in soup.find_all("div", class_="class123"):
...     print(result.find_all(text=True, recursive=False))
... 
['\n', '\n\n  I want to parse this, how can do that?\n']

最后，您可以整理结果以返回字符串

>>> print(" ".join([s.strip() for s in \
...     soup.find("div", class_="class123").find_all(text=True, recursive=False) \
...     ]).strip())
I want to parse this, how can do that?

在beautifulsoup4中解析与html标记相同级别的文本

2 个答案: