Question

对于以下xhtml代码段，我需要使用BS4或xpath从结构化html中获取属性值对，属性名称存在于h5标记中，其值可以在span标记或ap标记中显示。

对于下面的代码我应该得到以下输出作为字典：

养老管理：＆＃39;动物：牛农：史密斯先生，＆＃39;

Milch类别：＆＃39;牛奶供应＆＃39;

服务：＆＃39;牛奶，酥油＆＃39;

动物颜色：＆＃39;红色，绿色......＆＃39;

<div id="animalcontainer" class="container last fixed-height">

                <h5>
                  Husbandary Management
                </h5>
                <span>
                  Animal: Cow
                </span>
                <span>
                  Farmer: Mr smith
                </span>
                <h5>
                  Milch Category
                </h5>
                <p>
                  Milk supply
                </p>
                <h5>
                  Services
                </h5>
                <p>
                  cow milk, ghee
                </p>
                <h5>
                  animal colors
                </h5>
                <span>
                  green,red
                </span>


              </div>

htmlcode.findAll（＆＃39; h5＆＃39;）找到h5元素，但我希望h5元素和后续元素都在另一个元素之前＆＃39; h5＆＃39;

Answer 1

使用lxml.html和XPath的示例解决方案：

选择所有h5元素
和每个h5元素，
1. 选择下一个兄弟元素 - following-sibling::*
2. 不是h5本身， - [not(self::h5)]
3. 并且最多包含兄弟姐妹之前的h5个数字 - [count(preceding-sibling::h5) = 1]然后是2，然后是3 ......

（for循环enumerate()从1开始）

示例代码，包含元素文本内容的简单打印（在元素上使用lxml.html＆＃39; s .text_content()）：

import lxml.html
html = """<div id="animalcontainer" class="container last fixed-height">

                <h5>
                  Husbandary Management
                </h5>
                <span>
                  Animal: Cow
                </span>
                <span>
                  Farmer: Mr smith
                </span>
                <h5>
                  Milch Category
                </h5>
                <p>
                  Milk supply
                </p>
                <h5>
                  Services
                </h5>
                <p>
                  cow milk, ghee
                </p>
                <h5>
                  animal colors
                </h5>
                <span>
                  green,red
                </span>


              </div>"""
doc = lxml.html.fromstring(html)
headers = doc.xpath('//div/h5')
for i, header in enumerate(headers, start=1):
    print "--------------------------------"
    print header.text_content().strip()
    for following in header.xpath("""following-sibling::*
                                     [not(self::h5)]
                                     [count(preceding-sibling::h5) = %d]""" % i):
        print "\t", following.text_content().strip()

输出：

--------------------------------
Husbandary Management
    Animal: Cow
    Farmer: Mr smith
--------------------------------
Milch Category
    Milk supply
--------------------------------
Services
    cow milk, ghee
--------------------------------
animal colors
    green,red

Answer 2

我终于使用BS做了它，似乎可以更有效地完成，因为以下解决方案每次都会重新生成兄弟姐妹：

h5s=addinfo.findAll('h5')
txtcontents=[]
datad={}
for h5el in h5s:
    hcontents=list(h5el.nextSiblingGenerator())
    txtcontents=[]
    for con in hcontents:
        try:
            if con.name=='h5':
                break
        except AttributeError:
            print "error:",con

            continue
        txtcontents.append(con.contents)
    datad["\n".join(h5el.contents)]=txtcontents
print datad

使用BeautifulSoup或XPATH获取内容属性值对

2 个答案: