我正在尝试删除一个非常奇怪的网站,该网站使用了许多br,没有类或id。我用美丽的汤:
url = "http://www.fveconstruction.ch/anDetails.asp?RT=2&M=06&R=4&ID=10003401"
get_url = requests.get(url)
get_text = get_url.text
soup = BeautifulSoup(get_text, "html.parser")
这里是HTML代码的示例,我需要在br标签之间废弃信息:
<table width="100%" border="0" cellpadding="0" cellspacing="0">
<tbody>
<tr>
<td valign="top">
<img src="/images/spacer.gif" width="1" height="5" border="0" alt="">
<span class="titreentreprise">A. GACHET PEINTURE</span>
<br>
<span class="entrepriseDef">Cette entreprise est membre de la FVE </span>
<br>
<br>
RTE D'YVERDON 1
<br>
1028 PREVERENGES
<br>
Tél : (021) 801 97 10
<br>
Natel : (079) 212 74 01
<br>
e-Mail : <a href="mailto:am.gachet@bluewin.ch">am.gachet@bluewin.ch</a> <br>
<br>
Statut : RAISON INDIVIDUELLE
<br>
Date de fondation : 01.01.1992
<br>
</td>
</tr>
</tbody>
</table>
如果有人可以帮助我,那真的很棒!
谢谢=)
答案 0 :(得分:2)
假设所有页面都有一个包含类&#34; entrepriseDef&#34;的span标记,我会将所有内容提取到一个干净的列表中,然后提取数据并将其放入dict:
#!/usr/bin/env python3
import requests
from bs4 import BeautifulSoup
url = "http://www.fveconstruction.ch/anDetails.asp?RT=2&M=06&R=4&ID=10003401"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
# First, put the data in a clean list
contact_name = soup.find("span", class_="titreentreprise").string
contact_info_raw = str(soup.find_all("span",
class_="entrepriseDef")[0].next_sibling()[0])
contact_info = contact_info_raw.split('<br>')
contact_info = filter(None, contact_info)
del contact_info[-1]
# Now, parse the list to get the actual data
address = ""
contact_data = {}
for line in contact_info:
split_line = line.split('\xc2\xa0:\xc2\xa0')
if len(split_line) == 1:
address += ' ' + line
else:
contact_data[split_line[0]] = split_line[1]
contact_data["address"] = address
email = BeautifulSoup(contact_data['e-Mail'])
contact_data['e-Mail'] = email.find('a').string
print(contact_name, contact_data)
它告诉我:
(u'A.BUACHE & FILS SARL', {'Fax': '(026) 660 23 56', 'Statut': 'SOCIETE A RESPONSABILITE LIMITEE', 'Natel': '(079) 376 34 45', 'e-Mail': 'tbuache@bluewin.ch', 'T\xc3\xa9l': '(026) 660 68 21', 'address': ' CASE POSTALE 31 RTE DE MALADAIRE 9 1562 CORCELLES-PAYERNE', 'Date de fondation': '04.12.2008'})