我想使用Python从html页面的pre标签下获取一些数据。
我尝试先使用Selenium,但是它无法通过xpath找到元素。
browser = webdriver.Ie()
wait = WebDriverWait(browser, 5)
browser.get('file:\\\my_url.html')
body= wait.until(EC.presence_of_element_located((By.XPATH, "/html/body/pre[2]")))
print(body.text)
我尝试使用bs4。但是,BeautifulSoup一直告诉我我的浏览器不支持Frames扩展。我对bs4不熟悉,无法找到任何有用的解决方案。谁能告诉我如何修改IE浏览器的设置以成功读取数据?谢谢!
import urllib.request
from bs4 import BeautifulSoup
from urllib.request import urlopen
import html2text
url = " " #this html page is on a network drive and can be opened by IE\Chrome\...
html = urlopen(url).read()
soup = BeautifulSoup(html, "html.parser")
for script in soup(["script", "style"]):
script.extract() # rip it out
text = soup.get_text()
lines = (line.strip() for line in text.splitlines())
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
text = '\n'.join(chunk for chunk in chunks if chunk)
print(text)
>>>This page is designed to be viewed by a browser which supports Frames extension.
This text will be shown by browsers which do not support the Frames extension.
答案 0 :(得分:0)
您的pre
元素位于名称为“ glhstry_main”的<frame>
内部,因此在访问您的元素之前,您需要先切换到它。在这里:
browser = webdriver.Ie()
wait = WebDriverWait(browser, 5)
browser.get('file:\\\my_url.html')
browser.switch_to_frame("glhstry_main") // switching to the frame
body= wait.until(EC.presence_of_element_located((By.XPATH, "/html/body/pre[2]")))
print(body.text)
//do your frame stuff
driver.switch_to.default_content() // switching back to original HTML from the frame