Question

我是selenium的新手，我正在尝试几个网站用于测试目的。遇到了泰米尔语和印地语字体被废弃的场景＆＃34; ??????＆＃34;

我尝试通过notepad ++，sublimetext和excel打开输出，但仍显示为＆＃34; ??????＆＃34;

Xpath tried - //h1//following::p[@id='topDescription']

Test URLs
"https://www.hooq.tv/catalog/7a6d593d-e8f3-47b6-92ae-469b8e08178e?__sr=feed"
"https://www.hooq.tv/catalog/d023630f-882b-4df4-8cb5-857ebfff20b4?__sr=feed"

码

d.get("https://www.hooq.tv/catalog/7a6d593d-e8f3-47b6-92ae-469b8e08178e?__sr=feed");
d.findElement(By.xpath("//h1//following::p[@id='topDescription']")).getText();

这是关于编码问题吗？

Answer 1

首先，确保在将原始文本保存到外部文件之前可以正确获取原始文本。

我在java中为你的元素测试了.getText（），它按原样返回String。

接下来，您需要确保在文件写入期间，字符集编码为UTF-8。

以下是使用org.apache.commons.io.FileUtils的示例：

FileUtils.write(new File("C:/temp/test.txt"), str, "UTF-8");
FileUtils.write(new File("C:/temp/test.csv"), str, "UTF-8");

希望它有所帮助。

无法刮掉非英文字体 - 硒

1 个答案: