Question

我有一个问题，我需要帮助。我试图从网站上抓一些数字（参见下面代码中的链接）。因为网站是使用JavaScript加载的，所以我使用selenium首先加载页面，然后将其传递给xlml来解析数据。

我使用的代码如下：

from selenium import webdriver
from lxml import html
import time

url = "http://sebgroup.com/large-corporates-and-institutions/prospectuses-and-downloads/rates/swap-rates"
xpath = '//*[@id="doc"]/table[2]/tbody/tr[3]/text()'

chrome_path = "my_path"
browser = webdriver.Chrome(chrome_path)
browser.get(url)
time.sleep(10)

html_source = browser.page_source

tree = html.fromstring(html_source)
text = tree.xpath(xpath)
print (text)

当我通过浏览器直接查看页面时，我可以看到源代码中的数字。但是当我使用硒做同样的事情时，我看到的源代码是不同的。我想知道这是否因为该网站有一些反刮软件？反正还有数据吗？（我需要它们用于学术用途）。

Answer 1

您要处理的表位于iframe内，因此您应该在获取页面源之前切换到该表。请尝试以下方法：

chrome_path = "my_path"
browser = webdriver.Chrome(chrome_path)
browser.get(url)
time.sleep(10)
browser.switch_to.frame(browser.find_element_by_tag_name("iframe"))
html_source = browser.page_source

使用Selenium和lxml进行Python Web Scraping

1 个答案: