使用BeautifulSoup或XPATH获取内容属性值对

时间:2014-04-22 09:02:17

标签: html xpath web-scraping beautifulsoup lxml

对于以下xhtml代码段,我需要使用BS4或xpath从结构化html中获取属性值对,属性名称存在于h5标记中,其值可以在span标记或ap标记中显示。

对于下面的代码我应该得到以下输出作为字典:

养老管理:'动物:牛农:史密斯先生,'

Milch类别:'牛奶供应'

服务:'牛奶,酥油'

动物颜色:'红色,绿色......'

<div id="animalcontainer" class="container last fixed-height">

                <h5>
                  Husbandary Management
                </h5>
                <span>
                  Animal: Cow
                </span>
                <span>
                  Farmer: Mr smith
                </span>
                <h5>
                  Milch Category
                </h5>
                <p>
                  Milk supply
                </p>
                <h5>
                  Services
                </h5>
                <p>
                  cow milk, ghee
                </p>
                <h5>
                  animal colors
                </h5>
                <span>
                  green,red
                </span>


              </div>

htmlcode.findAll(&#39; h5&#39;)找到h5元素,但我希望h5元素和后续元素都在另一个元素之前&#39; h5&#39;

2 个答案:

答案 0 :(得分:2)

使用lxml.html和XPath的示例解决方案:

  1. 选择所有h5元素
  2. 和每个h5元素,
    1. 选择下一个兄弟元素 - following-sibling::*
    2. 不是h5本身, - [not(self::h5)]
    3. 并且最多包含兄弟姐妹之前的h5个数字 - [count(preceding-sibling::h5) = 1]然后是2,然后是3 ......
  3. for循环enumerate()从1开始)

    示例代码,包含元素文本内容的简单打印(在元素上使用lxml.html&#39; s .text_content()):

    import lxml.html
    html = """<div id="animalcontainer" class="container last fixed-height">
    
                    <h5>
                      Husbandary Management
                    </h5>
                    <span>
                      Animal: Cow
                    </span>
                    <span>
                      Farmer: Mr smith
                    </span>
                    <h5>
                      Milch Category
                    </h5>
                    <p>
                      Milk supply
                    </p>
                    <h5>
                      Services
                    </h5>
                    <p>
                      cow milk, ghee
                    </p>
                    <h5>
                      animal colors
                    </h5>
                    <span>
                      green,red
                    </span>
    
    
                  </div>"""
    doc = lxml.html.fromstring(html)
    headers = doc.xpath('//div/h5')
    for i, header in enumerate(headers, start=1):
        print "--------------------------------"
        print header.text_content().strip()
        for following in header.xpath("""following-sibling::*
                                         [not(self::h5)]
                                         [count(preceding-sibling::h5) = %d]""" % i):
            print "\t", following.text_content().strip()
    

    输出:

    --------------------------------
    Husbandary Management
        Animal: Cow
        Farmer: Mr smith
    --------------------------------
    Milch Category
        Milk supply
    --------------------------------
    Services
        cow milk, ghee
    --------------------------------
    animal colors
        green,red
    

答案 1 :(得分:0)

我终于使用BS做了它,似乎可以更有效地完成,因为以下解决方案每次都会重新生成兄弟姐妹:

h5s=addinfo.findAll('h5')
txtcontents=[]
datad={}
for h5el in h5s:
    hcontents=list(h5el.nextSiblingGenerator())
    txtcontents=[]
    for con in hcontents:
        try:
            if con.name=='h5':
                break
        except AttributeError:
            print "error:",con

            continue
        txtcontents.append(con.contents)
    datad["\n".join(h5el.contents)]=txtcontents
print datad