Question

我正在尝试对目标网站进行网络抓取，例如价格，名称，产品的jpeg等详细信息，但是使用beautifulsoup通过python提取的内容似乎与目标网站的html不匹配（使用F12）。

我尝试在beautifulsoup函数中使用html.parser和lxml，但是两者似乎没有什么区别。我已经尝试使用Google搜索类似的问题，但没有发现任何问题。我正在使用Atom运行python代码，并正在使用Ubuntu 18.04.2。我在使用python方面还很陌生，但是之前已经进行了编码。

url = 'https://www.target.com/s?searchTerm=dove'
# Gets html from the given url
response = get(url)
html_soup = BeautifulSoup(response.text, 'html.parser')
items = html_soup.find_all('li', class_ = 'bkaxin')
print(len(items))

假设输出28，但我始终得到0

Answer 1

您要查找的元素似乎不存在，因为它们是在网站加载后动态创建的。您可以通过在网站首次加载时查看源代码来自己查看。您也可以尝试打印html_soup.prettify()，然后会发现您要查找的元素不存在。

受this question的启发，我提出了一个基于selenium的解决方案：

from bs4 import BeautifulSoup
from selenium import webdriver

url = "https://www.target.com/s?searchTerm=dove"
driver = webdriver.Firefox()

driver.get(url)
html = driver.page_source
html_soup = BeautifulSoup(html, 'html.parser')
items = html_soup.find_all('li', class_ = 'bkaXIn')
driver.close()

print(len(items))

当我运行它时，先前的代码会输出28。

请注意，您需要安装硒（installation guide here）和合适的驱动程序才能工作（在我的解决方案中，我使用了可以下载here的Firefox驱动程序）。

还要注意，我在class_ = 'bkaXIn'中使用了html_soup.find_all（区分大小写！）。

为什么python输出与目标网站的html不匹配

1 个答案: