Question

我试图使用BeautifulSoup和lxml来解析以下有代表性的HTML提取：

[<p class="fullDetails">
<strong>Abacus Trust Company Limited</strong>
<br/>Sixty Circular Road

            <br/>DOUGLAS

            <br/>ISLE OF MAN
            <br/>IM1 1SA
            <br/>
<br/>Tel: 01624 689600
            <br/>Fax: 01624 689601
        <br/>
<br/>
<span class="displayBlock" id="ctl00_ctl00_bodycontent_MainContent_Email">E-mail:  </span>
<a href="mailto:email@abacusion.com" id="ctl00_ctl00_bodycontent_MainContent_linkToEmail">email@abacusion.com</a>
<br/>
<span id="ctl00_ctl00_bodycontent_MainContent_Web">Web: </span>
<a href="http://www.abacusiom.com" id="ctl00_ctl00_bodycontent_MainContent_linkToSite">http://www.abacusiom.com</a>
<br/>
<br/><b>Partners(s) - ICAS members only:</b> S H Fleming, M J MacBain
        </p>]

我想做什么：

提取强大的＆＃39; text into company_name
提取＆＃39; br＆＃39;将文本标记为company_line_x
提取＆＃39; MainContent_Email＆＃39;发短信到company_email
提取＆＃39; MainContent_Web＆＃39;发短信到company_web

我遇到的问题：

1）我可以使用.findall（text = True）提取所有文本，但每行都有很多填充

2）有时会返回非ASCII字符，这会导致csv.writer失败。我不能100％确定如何正确处理。（我之前刚刚使用过unicodecsv.writer）

任何建议都会非常感激！

目前，我的功能只是接收页面数据并隔离了＆＃39; p＆＃39;

def get_company_data(page_data):
    if not page_data:
        pass
    else:
        company_dets=page_data.findAll("p",{"class":"fullDetails"})
        print company_dets
        return company_dets

Answer 1

这是一个完整的解决方案：

from bs4 import BeautifulSoup, NavigableString, Tag

data = """
your html here
"""

soup = BeautifulSoup(data)
p = soup.find('p', class_='fullDetails')

company_name = p.strong.text
company_lines = []
for element in p.strong.next_siblings:
    if isinstance(element, NavigableString):
        text = element.strip()
        if text:
            company_lines.append(text)

company_email = p.find('span', text=lambda x: x.startswith('E-mail:')).find_next_sibling('a').text
company_web = p.find('span', text=lambda x: x.startswith('Web:')).find_next_sibling('a').text

print company_name
print company_lines
print com[enter link description here][1]pany_email, company_web

打印：

Abacus Trust Company Limited
[u'Sixty Circular Road', u'DOUGLAS', u'ISLE OF MAN', u'IM1 1SA', u'Tel: 01624 689600', u'Fax: 01624 689601', u'S H Fleming, M J MacBain']
email@abacusion.com http://www.abacusiom.com

请注意，要获取公司行，我们必须遍历strong代码next siblings并获取所有文本节点。 company_email和company_web由标签检索，换句话说，by the text标记前面的{{3}}。

Answer 2

使用p

与findall()数据相同

（我使用lxml作为以下示例代码）

获取公司名称：

company_name  = ''
for strg in root.findall('strong'):
    company_name = strg.text     # this will give you Abacus Trust Company Limited

获取公司行/详细信息：

company_line_x = ''
lines = []
for b in root.findall('br'):
    if b.tail:
        addr_line = b.tail.strip()
        lines.append(addr_line) if addr_line != '' else None

company_line_x = ', '.join(lines) # this will give you Sixty Circular Road, DOUGLAS, ISLE OF MAN, IM1 1SA, Tel: 01624 689600, Fax: 01624 689601

Python - 解析HTML类

2 个答案: