我想解析这个html块:
<div class="class123">
<div><strong>title123</strong>
<span style="something123">something else</span>
</div>
I want to parse this, how can do that?
</div>
我如何用beautifulsoup解析它?我知道如何解析标签内的内容,但是如何解析同一级别的内容?
soup1.find("div", class_="class123")
抓住第一个div内的所有内容
答案 0 :(得分:0)
您可以将div内容迭代为
>>> from bs4 import NavigableString
>>> for x in soup.find("div", class_="class123").contents:
... if isinstance(x, NavigableString):
... print x.strip()
...
I want to parse this, how can do that?
content
将是父级中包含的Tag
和NavigableString
个对象的列表。
此处NavigableString
是不包含任何子元素的字符串。
答案 1 :(得分:0)
我认为你要问的是如何提取元素中包含的文本,而不是子元素或子元素中包含的文本。
您可以使用.findall(text=True, recursive=False)
(请参阅Only extracting text from this element, not its children)。
>>> from bs4 import BeautifulSoup
>>> soup=BeautifulSoup(
... """<div class="class123">
... <div><strong>title123</strong>
... <span style="something123">something else</span>
... </div>
...
... I want to parse this, how can do that?
... </div>""", 'lxml')
>>>
>>> print(soup.find("div", class_="class123").find_all(text=True, recursive=False))
['\n', '\n\n I want to parse this, how can do that?\n']
如果有多个匹配的<div>
元素,您将不得不循环使用它们
>>> for result in soup.find_all("div", class_="class123"):
... print(result.find_all(text=True, recursive=False))
...
['\n', '\n\n I want to parse this, how can do that?\n']
最后,您可以整理结果以返回字符串
>>> print(" ".join([s.strip() for s in \
... soup.find("div", class_="class123").find_all(text=True, recursive=False) \
... ]).strip())
I want to parse this, how can do that?