使用Jsoup和PhantomJS无法正确解码字符

时间:2016-04-06 14:51:21

标签: python selenium character-encoding phantomjs jsoup

这就是问题,我在Python中使用PhantomJS和Selenium来渲染页面,这就是代码:

import sys, time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

path_to_chromedriver = 'C:\\..\\chromedriver'

section = sys.argv[1]
path = sys.argv[2]
links = sys.argv[3]

listOfLinks = []
file = open(links, 'r')
for link in file:
    listOfLinks.append(link)

dr = webdriver.Chrome(executable_path = path_to_chromedriver)

cont = 0
for link in listOfLinks:
    try:
        dr.get(link)

        # Wait.
        element = WebDriverWait(dr, 20).until(
            EC.presence_of_element_located((By.CLASS_NAME, "_img-zoom"))
        )

        time.sleep(1)

        htmlPath = path + section + "_" + str(cont) + ".html"

        # Write HTML.
        file = open(htmlPath, 'w')
        file.write(dr.page_source)
        file.close()

        cont = cont + 1
    except:
        print("Exception")

dr.quit()

此代码创建作为参数收到的链接的HTML。

此文件由Java中的Jsoup解析:

Document document = Jsoup.parse( file, "UTF-8" );

但是,'€','á','é','í'等特殊字符未正确解码,而是被'?'取代。我该如何解决这个问题?

1 个答案:

答案 0 :(得分:0)

Uzochi

找到的解决方案
  

试用Document document = Jsoup.parse(file,“ISO-8859-1”);