用美丽的汤循环标签的有效方法

时间:2016-06-09 13:40:32

标签: python for-loop beautifulsoup

我想从多个结构相似的xml标签中提取信息。我遍历每个孩子,将其附加到字典中。有没有办法避免每个标签的for循环(比如我的MWE中的sn和count)。

from bs4 import BeautifulSoup as bs
import pandas as pd

xml = """
    <info>
    <tag>
         <sn>9-542</sn>
         <count>14</count>
    </tag>
    <tag>
         <sn>3-425</sn>
         <count>16</count>
    </tag>
    </info>
    """

bs_obj = bs(xml, "lxml")
info = bs_obj.find_all('tag')


d = {}

# I want to avoid these multiple for-loops
d['sn'] = [i.sn.text for i in info]
d['count'] = [i.count.text for i in info]

pd.DataFrame(d)

1 个答案:

答案 0 :(得分:1)

考虑以下方法 有两个for循环只是为了使这个解决方案是动态的(如果你想要另一个标签,唯一要改变的是needed_tags列表):

from collections import defaultdict

d = defaultdict(list)

needed_tags = ['sn', 'count']
for i in info:
    for tag in needed_tags:
        d[tag].append(getattr(i, tag).text)

print(d)
>> defaultdict(<class 'list'>, {'count': ['14', '16'], 'sn': ['9-542', '3-425']})

对于您的确切示例,可以简化为:

from collections import defaultdict

d = defaultdict(list)

for i in info:
   d['sn'].append(i.sn.text)
   d['count'].append(i.count.text)

print(d)
>> defaultdict(<class 'list'>, {'count': ['14', '16'], 'sn': ['9-542', '3-425']})