Microsoft Translator API中的转义字符-如何避免翻译转义的实体?

时间:2019-04-12 14:39:26

标签: php escaping microsoft-translator bing-translator-api

我正在使用Microsoft翻译API将一些科学文章翻译成不同的语言。但是,当我翻译带有一些转义的希腊字符和上标文本的句子时,结果与含义不尽相同。例如,这是原始的html文本:

<p>DNA polymerases (Pols) &#945;, &#949;, and &#948;<sup class="xref">1</sup><sup>,</sup><sup class="xref">2</sup></p>. 

html实体为alpha,epsilon和delta,su标签1,2是此句子的参考。

我用来翻译句子的代码来自MicrosoftTranslator / Text-Translation-API-V3-PHP的示例。

<?php

// NOTE: Be sure to uncomment the following line in your php.ini file.
// ;extension=php_openssl.dll

// **********************************************
// *** Update or verify the following values. ***
// **********************************************

// Replace the subscriptionKey string value with your valid subscription key.
$key = 'ENTER YOUR KEY';

$host = "https://api.cognitive.microsofttranslator.com";
$path = "/translate?api-version=3.0";

// Translate to German and Italian.
$params = "&to=de&textType=html";

$text = '<p>DNA polymerases (Pols) &#945;, &#949;, and &#948;<sup class="xref">1</sup><sup>,</sup><sup class="xref">2</sup>. ';

if (!function_exists('com_create_guid')) {
  function com_create_guid() {
    return sprintf( '%04x%04x-%04x-%04x-%04x-%04x%04x%04x',
        mt_rand( 0, 0xffff ), mt_rand( 0, 0xffff ),
        mt_rand( 0, 0xffff ),
        mt_rand( 0, 0x0fff ) | 0x4000,
        mt_rand( 0, 0x3fff ) | 0x8000,
        mt_rand( 0, 0xffff ), mt_rand( 0, 0xffff ), mt_rand( 0, 0xffff )
    );
  }
}

function Translate ($host, $path, $key, $params, $content) {

    $headers = "Content-type: application/json\r\n" .
        "Content-length: " . strlen($content) . "\r\n" .
        "Ocp-Apim-Subscription-Key: $key\r\n" .
        "X-ClientTraceId: " . com_create_guid() . "\r\n";

    // NOTE: Use the key 'http' even if you are making an HTTPS request. See:
    // http://php.net/manual/en/function.stream-context-create.php
    $options = array (
        'http' => array (
            'header' => $headers,
            'method' => 'POST',
            'content' => $content
        )
    );
    $context  = stream_context_create ($options);
    $result = file_get_contents ($host . $path . $params, false, $context);
    return $result;
}

$requestBody = array (
    array (
        'Text' => $text,
    ),
);
$content = json_encode($requestBody, JSON_UNESCAPED_UNICODE);

$result = Translate ($host, $path, $key, $params, $content);

// Note: We convert result, which is JSON, to and from an object so we can pretty-print it.
// We want to avoid escaping any Unicode characters that result contains. See:
// http://php.net/manual/en/function.json-encode.php
$json = json_encode(json_decode($result), JSON_UNESCAPED_UNICODE | JSON_PRETTY_PRINT);
echo $json;
?>

我已将参数'textType = html'设置为停止翻译HTML标记。

这就是我得到的:

[
    {
        "detectedLanguage": {
            "language": "en",
            "score": 0.77
        },
        "translations": [
            {
                "text": "<p>DNA-Polymerasen (Pols) α,-, und<sup class=\"xref\"><\/sup>,-1<sup>,<\/sup><sup class=\"xref\">2<\/sup>.",
                "to": "de"
            }
        ]
    }
]

翻译后聚合酶的名称会丢失,上标标签中的文本也会更改。

然后,我尝试使用文档中提到的class =“ notranslate”包装HTML实体。

$text = '<p>DNA polymerases (Pols) <span class="notranslate">&#945;</span>, <span class="notranslate">&#949;</span>, and <span class="notranslate">&#948;</span><sup class="xref">1</sup><sup>,</sup><sup class="xref">2</sup>. ';

翻译后的结果是:

[
    {
        "detectedLanguage": {
            "language": "en",
            "score": 1
        },
        "translations": [
            {
                "text": "<p>DNA-Polymerasen (Pols <span class=\"notranslate\"> &#945; <\/span>) <span class=\"notranslate\"> &#949; <\/span>, und <span class=\"notranslate\"> &#948; <\/span> <sup class=\"xref\">1,2<\/sup><sup><\/sup><sup class=\"xref\"><\/sup>.",
                "to": "de"
            }
        ]
    }
]

这也改变了句子的含义。

我想知道HTML实体在编码和解码过程中是否发生了变化,但是当我将其翻译成西班牙语时,它的翻译效果很好,而无需使用任何“ notranslate”包装。

有人在翻译过程中遇到相同的问题吗?实际上,在将其迁移到API版本3之前,它运行良好,但是旧版本将在2019年4月之后弃用。

0 个答案:

没有答案