我想从以下地址获取电子邮件地址:http://www.ceice.gva.es/abc/i_guiadecentros/es/centro.asp?codi=46000110
我已经成功尝试了以下代码:
from bs4 import BeautifulSoup
html = '''<tr>
<td bgcolor="#EBEBEB"><div align="right"><span class="Estilo1">E-Correo:</span></div></td>
<td bgcolor="#F4F4F4"><span class="Estilo1">secretaria@mjosefacampos.com</span></td>
</tr>
'''
soup = BeautifulSoup(html, 'lxml')
data = [item.text.strip() for item in soup.select('[bgcolor="#F4F4F4"]')]
print(data)
输出:
['secretaria@mjosefacampos.com']
问题是我想从URL中获取它。我不想使用HTML。
谢谢!
答案 0 :(得分:0)
以下是从URL获取相同代码的代码:
import requests
import lxml
from bs4 import BeautifulSoup
html = '''<tr>
<td bgcolor="#EBEBEB"><div align="right"><span class="Estilo1">E-Correo:</span></div></td>
<td bgcolor="#F4F4F4"><span class="Estilo1">secretaria@mjosefacampos.com</span></td>
</tr>
'''
with requests.Session() as s:
url = 'http://www.myurl.com'
r = s.get(url)
soup = BeautifulSoup(r.text(), 'lxml')
data = [item.text.strip() for item in soup.select('[bgcolor="#F4F4F4"]')]
print(data)
答案 1 :(得分:0)
User-Agent
标头
import requests
from bs4 import BeautifulSoup
URL = "http://www.ceice.gva.es/abc/i_guiadecentros/es/centro.asp?codi=46000110"
HEADERS = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36"
}
soup = BeautifulSoup(requests.get(URL, headers=HEADERS).content, "lxml")
data = soup.select_one(".nivelCentro tr:contains('E-Correo:') [bgcolor='#F4F4F4'] span").text
print(data)
输出:
secretaria@mjosefacampos.com