Question

因此，我是网络爬虫的新手，我只想抓取主页的所有文本内容。

这是我的代码，但是现在可以正常工作了。

from bs4 import BeautifulSoup
import requests


website_url = "http://www.traiteurcheminfaisant.com/"
ra = requests.get(website_url)
soup = BeautifulSoup(ra.text, "html.parser")

full_text = soup.find_all()

print(full_text)

当我ctrl + f " traiteurcheminfaisant@hotmail.com"主页上的电子邮件地址（页脚）显示时，当我打印“全文”时，它会提供很多html内容，但不是全部在full_text上找不到。

感谢您的帮助！

Answer 1

快速浏览一下您要从中抓取的网站，这使我怀疑在通过请求模块发送简单的get请求时，并非所有内容都已加载。换句话说，似乎网站上的某些组件（例如您提到的页脚）正在使用Javascript异步加载。

在这种情况下，您可能需要使用某种自动化工具来导航至页面，等待页面加载，然后解析完整加载的源代码。为此，最常用的工具是Selenium。首次设置可能会有些棘手，因为您还需要为想要使用的任何浏览器安装单独的webdriver。就是说，我上次设置此设置非常简单。这是一个大概的例子，说明您的情况（一旦正确设置了Selenium）：

from bs4 import BeautifulSoup
from selenium import webdriver

import time

driver = webdriver.Firefox(executable_path='/your/path/to/geckodriver')
driver.get('http://www.traiteurcheminfaisant.com')
time.sleep(2)

source = driver.page_source
soup = BeautifulSoup(source, 'html.parser')

full_text = soup.find_all()

print(full_text)

Answer 2

我以前没有使用过BeatifulSoup，但是尝试改用urlopen。这会将网页存储为字符串，您可以使用该字符串来查找电子邮件。

from urllib.request import urlopen

try:
    response = urlopen("http://www.traiteurcheminfaisant.com")
    html = response.read().decode(encoding = "UTF8", errors='ignore')
    print(html.find("traiteurcheminfaisant@hotmail.com"))
except:
    print("Cannot open webpage")

如何抓取网站的所有首页文本内容？

2 个答案: