Question

我正在使用Python中的Selenium进行网络抓取。而且我正在使用xpath提取网站的部分内容。

我想知道如何使用循环提取URL列表并将其保存到字典中。

mylist_URLs = ['https://www.sec.gov/cgi-bin/own-disp? action=getowner&CIK=0001560258',
'https://www.sec.gov/cgi-bin/own-disp?action=getissuer&CIK=0000034088',
'https://www.sec.gov/cgi-bin/own-disp?action=getissuer&CIK=0001048911']

我下面的编码仅适用于1个网址...

driver = webdriver.Chrome(r'xxx\chromedriver.exe')
driver.get('https://www.sec.gov/cgi-bin/own-disp?action=getowner&CIK=0000104169')

driver.find_elements_by_xpath('/html/body/div/table[1]/tbody/tr[2]/td/table/tbody/tr[1]/td')[0].get_attribute('innerHTML')

谢谢您的帮助。

Answer 1

您可以在WebDriverWait的每个循环中使用simple来确保在获取innerHTML之前已加载表。

添加以下导入：

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

脚本：

mylist_URLs = ['https://www.sec.gov/cgi-bin/own-disp? action=getowner&CIK=0001560258',
'https://www.sec.gov/cgi-bin/own-disp?action=getissuer&CIK=0000034088',
'https://www.sec.gov/cgi-bin/own-disp?action=getissuer&CIK=0001048911']
# open the browser
driver = webdriver.Chrome(r'xxx\chromedriver.exe')
# iterate through all the urls
for url in mylist_URLs:
    print(url)
    driver.get(url)
    # wait for the table to present
    element = WebDriverWait(driver,30).until(EC.presence_of_element_located((By.XPATH, "(//table[1]/tbody/tr[2]/td/table/tbody/tr[1]/td)[1]"))
    # now get the element innerHTML
    print(element.get_attribute('innerHTML')))

如何从URL列表中获取get_attribute（'innerHTML'）-Selenium？

1 个答案: