提取i和br标记内的信息并保存在字典中

时间:2017-05-23 10:48:03

标签: python dictionary beautifulsoup

我有HTML页面,我需要在i标签和br标签中提取信息并将其保存在字典中,如下所示,

<div class="rbody">
<div style="color:#ff6666"> </div>
<i>objectid: </i> 137000<br/>
<i>topoid: </i> 504514394<br/>
<i>poigroup: </i> Hydrography<br/>
<i>poitype: </i> Manmade Waterbody<br/>
<i>poiname: </i> FOUR CORNERS DAM<br/>
<i>poilabel: </i> FOUR CORNERS DAM<br/>
<i>poilabeltype: </i> NAMED<br/>
<i>poialtlabel: </i> <br/>
<i>Point:</i><br/>
<i>X: </i> 1.5778346701624997E7 <br/>
<i>Y: </i> -3861557.6243750006 <br/>
<br/><br/>
</div>

我能够使用类rbody提取div标签内的文本。即使我能够提取i标签之间的内容,但不能提取br标签之前的信息。任何人都可以提出一种方法来提取信息并将其保存为字典中的键值对。例如

dictionary = {'objectid': 137000, 'topoid': 504514394, 'poigroup': 'Hydrography', 'poitype':'Manmade Waterbody', 'poiname' : 'FOUR CORNERS DAM', 'X':1.5778346701624997E7, 'y':-3861557.6243750006}

3 个答案:

答案 0 :(得分:0)

为什么不使用正则表达式,你不需要解析实际的HTML(除非你也需要位置信息):

import re

data = """
<div class="rbody">
<div style="color:#ff6666"> </div>
<i>objectid: </i> 137000<br/>
<i>topoid: </i> 504514394<br/>
<i>poigroup: </i> Hydrography<br/>
<i>poitype: </i> Manmade Waterbody<br/>
<i>poiname: </i> FOUR CORNERS DAM<br/>
<i>poilabel: </i> FOUR CORNERS DAM<br/>
<i>poilabeltype: </i> NAMED<br/>
<i>poialtlabel: </i> <br/>
<i>Point:</i><br/>
<i>X: </i> 1.5778346701624997E7 <br/>
<i>Y: </i> -3861557.6243750006 <br/>
<br/><br/>
</div>
"""

parsed = dict(element for element in re.findall(r"<i>\s*(.*?):.*?</i>\s*(.*?)\s*<br/>", data))
print(parsed)
# {'poigroup': 'Hydrography', 'objectid': '137000', 'topoid': '504514394', 'poilabeltype': 'NAMED', 'X': '1.5778346701624997E7', 'Point': '', 'poialtlabel': '', 'poitype': 'Manmade Waterbody', 'poiname': 'FOUR CORNERS DAM', 'poilabel': 'FOUR CORNERS DAM', 'Y': '-3861557.6243750006'}

如果您希望将X和Y转换为浮点数等,则可能需要执行额外的后处理。对于通用解决方案,您可能希望尝试将每个值转换为您可以使用的值:

def conv(pair):
    if len(pair) < 2 or not pair[1]:
        return pair[0], None
    try:
        return pair[0], int(pair[1])
    except ValueError:
        try:
            return pair[0], float(pair[1])
        except ValueError:
            return pair

parsed = dict(conv(element) for element in re.findall(r"<i>\s*(.*?):.*?</i>\s*(.*?)\s*<br/>", data))
print(parsed)
# {'X': 15778346.701624997, 'Y': -3861557.6243750006, 'objectid': 137000, 'poilabeltype': 'NAMED', 'poialtlabel': None, 'poiname': 'FOUR CORNERS DAM', 'poitype': 'Manmade Waterbody', 'Point': None, 'poilabel': 'FOUR CORNERS DAM', 'topoid': 504514394, 'poigroup': 'Hydrography'}

工作原理:简单,它在<i><br/>标签之间搜索两个匹配的组 - 一个紧跟在它之后,允许空格,一个在</i>之后再次允许空白。捕获所有此类匹配,并使用第一个捕获的组作为键循环,第二个作为新dict的值。

答案 1 :(得分:0)

查看以下方法:

from bs4 import BeautifulSoup as Soup

html = """<div class="rbody">
<div style="color:#ff6666"> </div>
<i>objectid: </i> 137000<br/>
<i>topoid: </i> 504514394<br/>
<i>poigroup: </i> Hydrography<br/>
<i>poitype: </i> Manmade Waterbody<br/>
<i>poiname: </i> FOUR CORNERS DAM<br/>
<i>poilabel: </i> FOUR CORNERS DAM<br/>
<i>poilabeltype: </i> NAMED<br/>
<i>poialtlabel: </i> <br/>
<i>Point:</i><br/>
<i>X: </i> 1.5778346701624997E7 <br/>
<i>Y: </i> -3861557.6243750006 <br/>
<br/><br/>
</div>"""

soup = Soup(html, 'html.parser')

obj = dict()
for i in soup.find_all('i'):
    key = str(i.get_text()).strip(' :')
    value = i.next_sibling
    if isinstance(value, NavigableString): # Check this because Point has not value.
        obj[key] = str(value).strip()
print(obj)

请注意,Point没有任何值,因此需要检查下一个兄弟是否为字符串。

有关详细信息,请查看.next_sibling and .previous_sibling以及如何使用BeautifulSoup浏览tagsnavigableStrings

仅在BeautifulSoup的帮助下打印以下内容:

{
  'poilabeltype': 'NAMED',
  'objectid': '137000',
  'poilabel': 'FOUR CORNERS DAM',
  'poialtlabel': '',
  'poigroup': 'Hydrography',
  'Y': '-3861557.6243750006',
  'X': '1.5778346701624997E7',
  'poiname': 'FOUR CORNERS DAM',
  'poitype': 'Manmade Waterbody',
  'topoid': '504514394'
}

答案 2 :(得分:0)

你可以先decompose&#34; br&#34;标记并使用select方法检索i标记,使用next_sibling获取该标记后的文字。

In [81]: from bs4 import BeautifulSoup as BS

In [82]: html = """<div class="rbody">
    ...: <div style="color:#ff6666"> </div>
    ...: <i>objectid: </i> 137000<br/>
    ...: <i>topoid: </i> 504514394<br/>
    ...: <i>poigroup: </i> Hydrography<br/>
    ...: <i>poitype: </i> Manmade Waterbody<br/>
    ...: <i>poiname: </i> FOUR CORNERS DAM<br/>
    ...: <i>poilabel: </i> FOUR CORNERS DAM<br/>
    ...: <i>poilabeltype: </i> NAMED<br/>
    ...: <i>poialtlabel: </i> <br/>
    ...: <i>Point:</i><br/>
    ...: <i>X: </i> 1.5778346701624997E7 <br/>
    ...: <i>Y: </i> -3861557.6243750006 <br/>
    ...: <br/><br/>
    ...: </div>"""

In [83]: soup = BS(html, "html.parser")

In [84]: for br in soup.select(".rbody > br"):
    ...:     br.decompose()
    ...:     

In [85]: {i.get_text(strip=True).replace(":", ""): i.next_sibling.strip() for i in soup.select(".rbody > i")}
Out[85]: 
{'Point': '',
 'X': '1.5778346701624997E7',
 'Y': '-3861557.6243750006',
 'objectid': '137000',
 'poialtlabel': '',
 'poigroup': 'Hydrography',
 'poilabel': 'FOUR CORNERS DAM',
 'poilabeltype': 'NAMED',
 'poiname': 'FOUR CORNERS DAM',
 'poitype': 'Manmade Waterbody',
 'topoid': '504514394'}