使用Open Calais编写法语文档中的字符集编码错误

时间:2017-11-21 15:35:24

标签: java utf-8 character-encoding opencalais

我正在尝试对OpenCalais API进行简单调用,以在用法语编写的原始文档中进行实体标记(因此有许多重音字符)。在返回的响应中,所有重音字符都转换为奇怪的符号。

我已经阅读了API文档,我将标题“Content-Type”设置为“text / raw; charset = utf-8”,我检查了文本肯定是用UTF-8编码的。

这是我用来从文件中读取内容的代码:

public static String readInput(String filename) {
    Path file = Paths.get(filename);
    Charset charset = Charset.forName("UTF-8");
    String line, content = "";
    try (BufferedReader reader = Files.newBufferedReader(file, charset)) {
        while ((line = reader.readLine()) != null) {
            content += line;
        }
    } catch (IOException x) {
        System.err.format("IOException: %s%n", x);
    }

    return content;
}

在发送请求之前,我已打印出从文件中读取的字符串。它显示原始文本没有编码错误。

以下是我用来发出请求的代码&从OpenCalais API获取响应:

// make call to the API link
    DefaultHttpClient httpClient = new DefaultHttpClient();
    HttpPost postRq = new HttpPost(url);

    // add necessary headers (custom)
    postRq.addHeader("x-ag-access-token", tokenKey);
    postRq.addHeader("x-calais-language", lang);
    postRq.addHeader("outputFormat", outputFormat);

    // add necessary headers (fixed)
    postRq.addHeader("Content-Type", "text/raw;charset=utf-8");
    postRq.addHeader("x-calais-contentClass", "news");
    postRq.addHeader("Accepted-Charset", "utf-8");

    // pass body content in the call
    StringEntity entityInput = null;
    try {
        entityInput = new StringEntity(text);
        postRq.setEntity(entityInput);
    } catch (UnsupportedEncodingException e) {
        e.printStackTrace();
    }

    // execute the call
    HttpResponse response = null;
    try {
        response = httpClient.execute(postRq);
    } catch (IOException e) {
        e.printStackTrace();
    }

    if (response.getStatusLine().getStatusCode() != 200) {
        throw new RuntimeException("Failed : HTTP error code : "
                + response.getStatusLine().getStatusCode());
    }

    // read the response
    String output, result = "";
    BufferedReader br = null;

    try {
        br = new BufferedReader(
                new InputStreamReader(response.getEntity().getContent(), "UTF-8"));
        while ((output = br.readLine()) != null) {
            result += output + "\n";
            System.out.println(output); // !!! the returned text has strange symbols
        }
        br.close();
    } catch (UnsupportedOperationException | IOException e) {
        e.printStackTrace();
    }

    // close the connection
    httpClient.getConnectionManager().shutdown();

以下是我尝试过的几件事(但却失败了):

  • 重写整个文本(无需复制粘贴),
  • 复制文字 Sublime Text,纠正所有可能的重音(我删除重音 字符并再次重写它们以确保没有意外 来自复制粘贴的编码冲突),使用UTF-8编码保存。

你能告诉我如何解决它吗?谢谢!

PS:我在他们的网站上的OpenCalais论坛中发布了我的问题,但还没有收到解决方案。

0 个答案:

没有答案