如何使用美丽汤获取文本

时间:2020-10-08 01:45:31

标签: python url text beautifulsoup get

我想从以下地址获取电子邮件地址:http://www.ceice.gva.es/abc/i_guiadecentros/es/centro.asp?codi=46000110

我已经成功尝试了以下代码:

from bs4 import BeautifulSoup
html = '''<tr>
          <td bgcolor="#EBEBEB"><div align="right"><span class="Estilo1">E-Correo:</span></div></td>
          <td bgcolor="#F4F4F4"><span class="Estilo1">secretaria@mjosefacampos.com</span></td>
          </tr>
'''
soup = BeautifulSoup(html, 'lxml')
data = [item.text.strip() for item in soup.select('[bgcolor="#F4F4F4"]')]
print(data)

输出:

['secretaria@mjosefacampos.com']

问题是我想从URL中获取它。我不想使用HTML。

谢谢!

2 个答案:

答案 0 :(得分:0)

以下是从URL获取相同代码的代码:

import requests
import lxml
from bs4 import BeautifulSoup
html = '''<tr>
        <td bgcolor="#EBEBEB"><div align="right"><span class="Estilo1">E-Correo:</span></div></td>
        <td bgcolor="#F4F4F4"><span class="Estilo1">secretaria@mjosefacampos.com</span></td>
        </tr>
'''
with requests.Session() as s:
    url = 'http://www.myurl.com'
    r = s.get(url)
    soup = BeautifulSoup(r.text(), 'lxml')
    data = [item.text.strip() for item in soup.select('[bgcolor="#F4F4F4"]')]
    print(data)

答案 1 :(得分:0)

  1. 您需要添加User-Agent标头
  2. 您的CSS选择器不正确

import requests
from bs4 import BeautifulSoup

URL = "http://www.ceice.gva.es/abc/i_guiadecentros/es/centro.asp?codi=46000110"
HEADERS = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36"
}
soup = BeautifulSoup(requests.get(URL, headers=HEADERS).content, "lxml")

data = soup.select_one(".nivelCentro tr:contains('E-Correo:') [bgcolor='#F4F4F4'] span").text

print(data)

输出:

secretaria@mjosefacampos.com