BeautifulSoup没有类或标签

时间:2016-04-02 21:21:53

标签: python beautifulsoup

我正在尝试删除一个非常奇怪的网站,该网站使用了许多br,没有类或id。我用美丽的汤:

url = "http://www.fveconstruction.ch/anDetails.asp?RT=2&M=06&R=4&ID=10003401"
get_url = requests.get(url)
get_text = get_url.text
soup = BeautifulSoup(get_text, "html.parser")

这里是HTML代码的示例,我需要在br标签之间废弃信息:

  • (021)801 97 10
  • (079)212 74 01
  • am.gachet@bluewin.ch

<table width="100%" border="0" cellpadding="0" cellspacing="0">
      <tbody>
        <tr>
          <td valign="top">
            <img src="/images/spacer.gif" width="1" height="5" border="0" alt="">
            <span class="titreentreprise">A. GACHET PEINTURE</span>
            <br>
            <span class="entrepriseDef">Cette entreprise est membre de la FVE&nbsp;&nbsp;</span>             
            <br>
            <br>
            RTE D'YVERDON 1
            <br>
            1028 PREVERENGES
            <br>
            Tél&nbsp;:&nbsp;(021) 801 97 10
            <br>
            Natel&nbsp;:&nbsp;(079) 212 74 01
            <br>
            e-Mail&nbsp;:&nbsp;<a href="mailto:am.gachet@bluewin.ch">am.gachet@bluewin.ch</a>               <br>
            <br>
            Statut&nbsp;:&nbsp;RAISON INDIVIDUELLE
            <br>
            Date de fondation&nbsp;:&nbsp;01.01.1992
            <br>
  	      </td>
        </tr>
    </tbody>
</table>

如果有人可以帮助我,那真的很棒!

谢谢=)

1 个答案:

答案 0 :(得分:2)

假设所有页面都有一个包含类&#34; entrepriseDef&#34;的span标记,我会将所有内容提取到一个干净的列表中,然后提取数据并将其放入dict:

#!/usr/bin/env python3

import requests
from bs4 import BeautifulSoup

url = "http://www.fveconstruction.ch/anDetails.asp?RT=2&M=06&R=4&ID=10003401"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

# First, put the data in a clean list
contact_name = soup.find("span", class_="titreentreprise").string
contact_info_raw = str(soup.find_all("span",
                       class_="entrepriseDef")[0].next_sibling()[0])
contact_info = contact_info_raw.split('<br>')
contact_info = filter(None, contact_info)
del contact_info[-1]

# Now, parse the list to get the actual data
address = ""
contact_data = {}
for line in contact_info:
    split_line = line.split('\xc2\xa0:\xc2\xa0')
    if len(split_line) == 1:
        address += ' ' + line
    else:
        contact_data[split_line[0]] = split_line[1]

contact_data["address"] = address
email = BeautifulSoup(contact_data['e-Mail'])
contact_data['e-Mail'] = email.find('a').string

print(contact_name, contact_data)

它告诉我:

(u'A.BUACHE & FILS SARL', {'Fax': '(026) 660 23 56', 'Statut': 'SOCIETE A RESPONSABILITE LIMITEE', 'Natel': '(079) 376 34 45', 'e-Mail': 'tbuache@bluewin.ch', 'T\xc3\xa9l': '(026) 660 68 21', 'address': ' CASE POSTALE 31 RTE DE MALADAIRE 9 1562 CORCELLES-PAYERNE', 'Date de fondation': '04.12.2008'})