我正在使用Microsoft翻译API将一些科学文章翻译成不同的语言。但是,当我翻译带有一些转义的希腊字符和上标文本的句子时,结果与含义不尽相同。例如,这是原始的html文本:
<p>DNA polymerases (Pols) α, ε, and δ<sup class="xref">1</sup><sup>,</sup><sup class="xref">2</sup></p>.
html实体为alpha,epsilon和delta,su标签1,2是此句子的参考。
我用来翻译句子的代码来自MicrosoftTranslator / Text-Translation-API-V3-PHP的示例。
<?php
// NOTE: Be sure to uncomment the following line in your php.ini file.
// ;extension=php_openssl.dll
// **********************************************
// *** Update or verify the following values. ***
// **********************************************
// Replace the subscriptionKey string value with your valid subscription key.
$key = 'ENTER YOUR KEY';
$host = "https://api.cognitive.microsofttranslator.com";
$path = "/translate?api-version=3.0";
// Translate to German and Italian.
$params = "&to=de&textType=html";
$text = '<p>DNA polymerases (Pols) α, ε, and δ<sup class="xref">1</sup><sup>,</sup><sup class="xref">2</sup>. ';
if (!function_exists('com_create_guid')) {
function com_create_guid() {
return sprintf( '%04x%04x-%04x-%04x-%04x-%04x%04x%04x',
mt_rand( 0, 0xffff ), mt_rand( 0, 0xffff ),
mt_rand( 0, 0xffff ),
mt_rand( 0, 0x0fff ) | 0x4000,
mt_rand( 0, 0x3fff ) | 0x8000,
mt_rand( 0, 0xffff ), mt_rand( 0, 0xffff ), mt_rand( 0, 0xffff )
);
}
}
function Translate ($host, $path, $key, $params, $content) {
$headers = "Content-type: application/json\r\n" .
"Content-length: " . strlen($content) . "\r\n" .
"Ocp-Apim-Subscription-Key: $key\r\n" .
"X-ClientTraceId: " . com_create_guid() . "\r\n";
// NOTE: Use the key 'http' even if you are making an HTTPS request. See:
// http://php.net/manual/en/function.stream-context-create.php
$options = array (
'http' => array (
'header' => $headers,
'method' => 'POST',
'content' => $content
)
);
$context = stream_context_create ($options);
$result = file_get_contents ($host . $path . $params, false, $context);
return $result;
}
$requestBody = array (
array (
'Text' => $text,
),
);
$content = json_encode($requestBody, JSON_UNESCAPED_UNICODE);
$result = Translate ($host, $path, $key, $params, $content);
// Note: We convert result, which is JSON, to and from an object so we can pretty-print it.
// We want to avoid escaping any Unicode characters that result contains. See:
// http://php.net/manual/en/function.json-encode.php
$json = json_encode(json_decode($result), JSON_UNESCAPED_UNICODE | JSON_PRETTY_PRINT);
echo $json;
?>
我已将参数'textType = html'设置为停止翻译HTML标记。
这就是我得到的:
[
{
"detectedLanguage": {
"language": "en",
"score": 0.77
},
"translations": [
{
"text": "<p>DNA-Polymerasen (Pols) α,-, und<sup class=\"xref\"><\/sup>,-1<sup>,<\/sup><sup class=\"xref\">2<\/sup>.",
"to": "de"
}
]
}
]
翻译后聚合酶的名称会丢失,上标标签中的文本也会更改。
然后,我尝试使用文档中提到的class =“ notranslate”包装HTML实体。
$text = '<p>DNA polymerases (Pols) <span class="notranslate">α</span>, <span class="notranslate">ε</span>, and <span class="notranslate">δ</span><sup class="xref">1</sup><sup>,</sup><sup class="xref">2</sup>. ';
翻译后的结果是:
[
{
"detectedLanguage": {
"language": "en",
"score": 1
},
"translations": [
{
"text": "<p>DNA-Polymerasen (Pols <span class=\"notranslate\"> α <\/span>) <span class=\"notranslate\"> ε <\/span>, und <span class=\"notranslate\"> δ <\/span> <sup class=\"xref\">1,2<\/sup><sup><\/sup><sup class=\"xref\"><\/sup>.",
"to": "de"
}
]
}
]
这也改变了句子的含义。
我想知道HTML实体在编码和解码过程中是否发生了变化,但是当我将其翻译成西班牙语时,它的翻译效果很好,而无需使用任何“ notranslate”包装。
有人在翻译过程中遇到相同的问题吗?实际上,在将其迁移到API版本3之前,它运行良好,但是旧版本将在2019年4月之后弃用。