我正在尝试使用包含web序列的selenium的python绑定来抓取Squarespace上托管的网站,我正在将div的内容复制到文本文件中。这是我用来做的代码:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Firefox()
i = 1
while i<250:
driver.get("http://www.drewhayesnovels.com/spy4/{0}".format(i))
content = driver.find_element_by_css_selector('p') #Get content
content_text = content.get_attribute('innerHTML')
file = open("output/ch{0}.txt".format(i), 'w')
file.write(str(content_text))
file.close()
print(i)
i = i + 1
print("Complete")
出于某种原因,在第三章中,它不再能够找到该元素。所以我写了这个小测试:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Opera()
i = 1
while i<250:
driver.get("http://www.drewhayesnovels.com/spy4/{0}".format(i))
try:
content = driver.find_element_by_css_selector('p') #Get content
content_text = content.get_attribute('innerHTML')
file = open("output/ch{0}.txt".format(i), 'w')
file.write(str(content_text))
file.close()
print(i)
i = i + 1
except:
print("Error, keep running.")
i = i + 1
print("Complete")
除此之外,只有5或6个看似随机的页面成功。我用另一个浏览器(Opera)尝试了它,它仍然无法正常工作。有什么想法吗?