按其分支和祖先对同一类中的元素进行排序

时间:2019-07-18 14:32:23

标签: python python-3.x beautifulsoup html-parsing

我有以下html(所有元素name *,name **和name ***都是未知的):

    <div class="one">nameA</a>
    <div class="two">nameAA</a>
        <a class="three">nameAAA</a>
        <a class="three">nameAAB</a>
        </div>
    <div class="two">nameAB</a>
        <a class="three">nameABA</a>
        <a class="three">nameABB</a>
        </div>
    </div>
<div class="one">nameB</a>
    <div class="two">nameBA</a>
        <a class="three">nameBAA</a>
        <a class="three">nameBAB</a>
        </div>
    <div class="two">nameBB</a>
        <a class="three">nameBBA</a>
        <a class="three">nameBBB</a>
        </div>
    </div>

并尝试制作此词典:

名称=     {nameA:[nameAAA,nameAAB,nameABA,nameABB],      nameB:[nameBAA,nameBAB,nameBBA,nameBBB]}

我正在使用beautifulSoup选择函数,但无法在“三个”后代类中的名称之间链接,它返回的是其在“一个”类中其祖先的名称。 实际上,我的代码中的结果是: wordOnesText = [nameA,nameB] wordThreesText = [nameAAA,nameAAB,nameABA,nameABB,nameBAA,nameBAB,nameBBA,nameBBB]

res = requests.get('address')
soup = bs4.BeautifulSoup(res.text, features='html.parser')
wordOnes = soup.select('.one')
wordThrees = soup.select('.three') or soup.select('.one > .two > .three')

您能帮我在字典中链接这两个列表吗?

2 个答案:

答案 0 :(得分:1)

您可以尝试使用此脚本。它利用itertools.groupbydoc)将元素分组以键值:

data = '''<a class="one">nameA</a>
    <a class="two">nameAA</a>
        <a class="three">nameAAA</a>
        <a class="three">nameAAB</a>
    <a class="two">nameAB</a>
        <a class="three">nameABA</a>
        <a class="three">nameABB</a>
<a class="one">nameB</a>
    <a class="two">nameBA</a>
        <a class="three">nameBAA</a>
        <a class="three">nameBAB</a>
    <a class="two">nameBB</a>
        <a class="three">nameBBA</a>
        <a class="three">nameBBB</a>'''

from bs4 import BeautifulSoup
from itertools import groupby

soup = BeautifulSoup(data, 'html.parser')

def get_key_values(soup):
    current_key = None
    for v, g in groupby(soup.select('.one, .three'), lambda k: 'one' in k['class']):
        if v is True:
            current_key = next(g).text
        else:
            yield current_key, [i.text for i in g]

out = dict(get_key_values(soup))

from pprint import pprint
pprint(out)

打印:

{'nameA': ['nameAAA', 'nameAAB', 'nameABA', 'nameABB'],
 'nameB': ['nameBAA', 'nameBAB', 'nameBBA', 'nameBBB']}

答案 1 :(得分:1)

尝试以下代码。

itemdict={}
soup=BeautifulSoup(data,'lxml')
for item in soup.select('.one'):
    itemlist = []
    name=item.contents[0].strip()
    for child in item.select('.three'):
        itemlist.append(child.text)
    itemdict[name]=itemlist

print(itemdict)

这应该打印。

{'nameA': ['nameAAA', 'nameAAB', 'nameABA', 'nameABB'], 'nameB': ['nameBAA', 'nameBAB', 'nameBBA', 'nameBBB']}