Question

好的，我一直在尝试解析一个

html标签，其中包含其他标签和文字

例如

如果我有这个HTML（是的，我知道使用<b>而<i>是坏的，但这只是一个简单的例子）

<p> <b> 1 </b> Apple <b> 2 </b> <i> Orange </i> <b> 3 </b> Pineapple </p>

它可以呈现这样的东西

1 Apple 2 橙色 3 Pineapple

如何获得

的关系

{"1": "Apple", "2": "<i> Orange </i>, "3": "Pineapple"}

我尝试过使用beautifulsoup tag.next，但它没有返回标签，而是停止

我尝试过使用beautifulsoup tag.find(text = True, recursive = False)除了\n

之外不会返回任何内容

我试过了tags.findAll("b")

for i in b:
    print i.text
    print tags.find(i).text

我已经查找了标签中的解析标签，并没有真正适合一些建议的正则表达式（听起来像麻烦），有些人说它不能完成（不是很有帮助）

我认为我要弄清楚如何做两个标签之间的html。我尝试迭代.nextSibling位它最终给了我一个unicode空间，所以无法继续迭代。

任何人都有这方面的经验吗？

Answer 1

在<b>中的每个<p>标记之前和之后累积元素（标记和文字）：

#!/usr/bin/env python
from collections import defaultdict
from BeautifulSoup import BeautifulSoup

d = defaultdict(list) # data structure to hold the result
soup = BeautifulSoup(html)
i = 0
for el in soup.p.contents:
    if getattr(el, 'name', None) == 'b':
       i += 1  # switch to next <b> element
    else:
       d[i].append(el)

import pprint
pprint.pprint(dict(d))

它正确地表达了意图，但它不具备可读性和效率。

输出

{0: [u' '],
 1: [u' Apple '],
 2: [u' ', <i> Orange </i>, u' '],
 3: [u' Pineapple ']}

解析HTML标签和尾随信息

1 个答案:

输出