Question

我正试图找到网站上的捐款按钮 The University of British Columbia。

捐款按钮位于页脚，在div中，分为＆＃34; span7＆＃34;

然而，当刮掉时，html在里面没有任何内容的情况下对其进行了诽谤。

我的程序与直接div作为源完美配合：

m_Options = new ChromeOptions();
m_Options.AddArgument("--user-data-dir=C:/Users/Me/AppData/Local/Google/Chrome/User Data");
m_Options.AddArgument("--profile-directory=Default");
m_Options.AddArgument("--disable-extensions");
m_Driver = new ChromeDriver(@"pathtoexe", m_Options);
m_Driver.Navigate().GoToUrl("somesite");

但是，使用该网站不起作用

from bs4 import BeautifulSoup as bs
import re

site = '''<div class="span7" id="ubc7-footer-menu"><div class="row-fluid"><div class="span6"><h3>About UBC</h3><div><a href="https://cdn.ubc.ca/clf/ref/contact">Contact UBC</a></div><div><a href="https://cdn.ubc.ca/clf/ref/about">About the University</a></div><div><a href="https://cdn.ubc.ca/clf/ref/news">News</a></div><div><a href="https://cdn.ubc.ca/clf/ref/events">Events</a></div><div><a href="https://cdn.ubc.ca/clf/ref/careers">Careers</a></div><div><a href="https://cdn.ubc.ca/clf/ref/gift">Make a Gift</a></div><div><a href="https://cdn.ubc.ca/clf/ref/search">Search UBC.ca</a></div></div><div class="span6"><h3>UBC Campuses</h3><div><a href="https://cdn.ubc.ca/clf/ref/vancouver">Vancouver Campus</a></div><div><a href="https://cdn.ubc.ca/clf/ref/okanagan">Okanagan Campus</a></div><h4>UBC Sites</h4><div><a href="https://cdn.ubc.ca/clf/ref/robson">Robson Square</a></div><div><a href="https://cdn.ubc.ca/clf/ref/centre-for-digital-media">Centre for Digital Media</a></div><div><a href="https://cdn.ubc.ca/clf/ref/medicine">Faculty of Medicine Across BC</a></div><div><a href="https://cdn.ubc.ca/clf/ref/asia">Asia Pacific Regional Office</a></div></div></div></'''

html = bs(site, 'html.parser')
link = html.find('a', string=re.compile('(?)(donate|donation|gift)')) 

#returns proper donation URL

我的解析器有问题吗？它是某种防刮策略吗？我注定了吗？

Answer 1

我似乎无法在您提供的URL上找到“捐赠”按钮，但您的解析器没有任何内在错误，只是您发送的GET请求只会为您提供最初从响应中返回的HTML，而不是等待页面完全呈现。

看来页面的某些部分是用Javascript填充的。您可以使用Splash，它用于呈现基于Javascript的页面。您可以非常轻松地在Docker中运行Splash，只需向Splash容器发出HTTP请求，该容器将返回看起来就像在Web浏览器中呈现的网页一样的HTML。

虽然这听起来过于复杂，但实际上设置起来非常简单，因为您根本不需要修改Docker镜像，而且您不需要以前的Docker知识就可以使它工作。它只需要命令行中的一行就可以启动本地Splash服务器：
docker run -p 8050:8050 -p 5023:5023 scrapinghub/splash

然后，您只需修改Python代码中的任何现有请求即可转发到splash：

即。 http://example.com/成为
http://localhost:8050/render.html?url=http://example.com/

网站隐藏页面页脚从解析器

1 个答案: