在beautifulsoup4中解析与html标记相同级别的文本

时间:2016-12-01 11:34:46

标签: python python-3.x beautifulsoup

我想解析这个html块:

<div class="class123">
  <div><strong>title123</strong>
    <span style="something123">something else</span>
  </div>

  I want to parse this, how can do that?
</div>

我如何用beautifulsoup解析它?我知道如何解析标签内的内容,但是如何解析同一级别的内容?

soup1.find("div", class_="class123") 

抓住第一个div内的所有内容

2 个答案:

答案 0 :(得分:0)

您可以将div内容迭代为

>>> from bs4 import NavigableString
>>> for x in soup.find("div", class_="class123").contents:
...     if isinstance(x, NavigableString):
...             print x.strip()
...

I want to parse this, how can do that?

content将是父级中包含的TagNavigableString个对象的列表。

此处NavigableString是不包含任何子元素的字符串。

答案 1 :(得分:0)

我认为你要问的是如何提取元素中包含的文本,而不是子元素或子元素中包含的文本。

您可以使用.findall(text=True, recursive=False)(请参阅Only extracting text from this element, not its children)。

>>> from bs4 import BeautifulSoup
>>> soup=BeautifulSoup(
...     """<div class="class123">
...   <div><strong>title123</strong>
...     <span style="something123">something else</span>
...   </div>
... 
...   I want to parse this, how can do that?
... </div>""", 'lxml')
>>> 
>>> print(soup.find("div", class_="class123").find_all(text=True, recursive=False))
['\n', '\n\n  I want to parse this, how can do that?\n']

如果有多个匹配的<div>元素,您将不得不循环使用它们

>>> for result in soup.find_all("div", class_="class123"):
...     print(result.find_all(text=True, recursive=False))
... 
['\n', '\n\n  I want to parse this, how can do that?\n']

最后,您可以整理结果以返回字符串

>>> print(" ".join([s.strip() for s in \
...     soup.find("div", class_="class123").find_all(text=True, recursive=False) \
...     ]).strip())
I want to parse this, how can do that?