在Beautifulsoup

时间:2016-10-25 20:22:41

标签: python web-scraping beautifulsoup

给出以下html代码段:

<div class="mapCopy">
    <b>
        <a href="someurl.com">
          URL Text
        </a>
    </b>
    <br/>
       Address Line 1
    <br/>
       Address Line 2
    <br/>
       City, State, Zip
    <p>
        Phone: (123) 456-7890
    <br/>
        Fax: (123) 456-7890
    </p>
</div>

如何仅提取地址行1,地址行2,城市,州和邮编?我相信我应该能够迭代div并排除任何带有<b>标记的元素,但我不确定必要的语法。

1 个答案:

答案 0 :(得分:0)

您可以提取<div>中不包含标签的所有子项:

>>> S = BeautifulSoup("<div...")
>>> [child.strip() for child in S.find('div').children
...      if "<" not in str(child)
...      and len(child) > 1
... ]
['Address Line 1', 'Address Line 2', 'City, State, Zip']