这就是问题,我在Python中使用PhantomJS和Selenium来渲染页面,这就是代码:
import sys, time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
path_to_chromedriver = 'C:\\..\\chromedriver'
section = sys.argv[1]
path = sys.argv[2]
links = sys.argv[3]
listOfLinks = []
file = open(links, 'r')
for link in file:
listOfLinks.append(link)
dr = webdriver.Chrome(executable_path = path_to_chromedriver)
cont = 0
for link in listOfLinks:
try:
dr.get(link)
# Wait.
element = WebDriverWait(dr, 20).until(
EC.presence_of_element_located((By.CLASS_NAME, "_img-zoom"))
)
time.sleep(1)
htmlPath = path + section + "_" + str(cont) + ".html"
# Write HTML.
file = open(htmlPath, 'w')
file.write(dr.page_source)
file.close()
cont = cont + 1
except:
print("Exception")
dr.quit()
此代码创建作为参数收到的链接的HTML。
此文件由Java中的Jsoup解析:
Document document = Jsoup.parse( file, "UTF-8" );
但是,'€','á','é','í'等特殊字符未正确解码,而是被'?'取代。我该如何解决这个问题?