Question

我正在尝试使用此python 2.7脚本从网站的源代码中提取关键字/字符串：

from selenium import webdriver

keyword = ['googleadservices']

driver = webdriver.Chrome(executable_path=r'C:\Users\Jacob\PycharmProjects\Testing\chromedriver_win32\chromedriver.exe')
driver.get('https://www.vacatures.nl/')

elem = driver.find_element_by_xpath("//*")
source_code = elem.get_attribute("outerHTML")

for searchstring in keyword:
    if searchstring.lower() in str(source_code).lower():
        print (searchstring, 'found')
    else:
        print (searchstring, 'not found')

幸运的是，当脚本运行时，浏览器会打开，但我无法从源代码中提取所需的关键字。有什么帮助吗？

Answer 1

我发现网页源代码中没有googleadservices。

代码没有问题。

我尝试使用GoogleAnalyticsObject，然后找到它。

from selenium import webdriver

keyword = ['googleadservices', 'GoogleAnalyticsObject']

driver = webdriver.Chrome()
driver.get('https://www.vacatures.nl/')

elem = driver.find_element_by_xpath("//*")
source_code = elem.get_attribute("outerHTML")

for searchstring in keyword:
    if searchstring.lower() in str(source_code).lower():
        print (searchstring, 'found')
    else:
        print (searchstring, 'not found')

而不是使用//*来查找源代码

elem = driver.find_element_by_xpath("//*")
source_code = elem.get_attribute("outerHTML")

使用以下代码：

source_code = driver.page_source

Answer 2

正如其他人所说，问题不在于您的代码，而在于源代码中只有blobxfer --saskey "$MYSASKEY" MY_STORAGE_ACCOUNT MY_CONTAINER "$LOCAL_DIR" --download --remoteresource . --include "directory1/directory2/directory3/*"。

我想补充的是，你的代码有点过于设计，因为如果源代码中存在某个字符串，你所做的就是返回true或false。

使用更好的xpath（例如List）并使用googleadservice并捕获可能的//script[contains(text(),'googletagmanager')]，您可以更轻松地实现这一目标。这可能会节省您的时间，而且您不需要for循环还有其他可能性，使用find_element_by_xpath或NoSuchElementException，然后检查返回的列表是否大于0.

使用selenium webdriver在URL的源代码中查找字符串

2 个答案: