我正在为iOS应用程序编写服务器API。作为初始化过程的一部分,应用程序应通过API调用将电话界面语言发送到服务器。
问题在于Apple在其IETF BCP 47 language identifier中使用了名为NSLocale preferredLanguages
function的内容。
返回的值有不同的长度(例如[aa, ab, ace, ach, ada, ady, ae, af, afa, afh, agq, ...]
,我发现很少有解析器可以将此代码转换为正确的语言标识符。
我想使用更常见的ISO-639-2 three-letters language identifier,它无处不在,有许多语言的解析器,并且有一个标准的3个字母的语言表示。
如何将IETF BCP 47语言标识符转换为ISO-639-2三字母语言标识符,最好是在Python中?
答案 0 :(得分:7)
BCP 47标识符以2个字母ISO 639-1 或 3个字母639-2,639-3或639-5语言代码开头;请参阅RFC 5646 Syntax section:
Language-Tag = langtag ; normal language tags / privateuse ; private use tag / grandfathered ; grandfathered tags langtag = language ["-" script] ["-" region] *("-" variant) *("-" extension) ["-" privateuse] language = 2*3ALPHA ; shortest ISO 639 code ["-" extlang] ; sometimes followed by ; extended language subtags / 4ALPHA ; or reserved for future use / 5*8ALPHA ; or registered language subtag
我不希望Apple使用privateuse
或grandfathered
表单,因此您可以假设您正在考虑ISO 639-1,ISO 639-2,ISO 639-3或ISO这里有639-5种语言代码。只需将2个字母的ISO-639-1代码映射到3个字母的ISO 639- *代码即可。
您可以使用pycountry
package:
import pycountry
lang = pycountry.languages.get(alpha2=two_letter_code)
three_letter_code = lang.terminology
演示:
>>> import pycountry
>>> lang = pycountry.languages.get(alpha2='aa')
>>> lang.terminology
u'aar'
术语表格是首选的3字母代码;还有一个参考书目表单,它仅对22个条目有所不同。见ISO 639-2 B and T codes。该套餐不包括ISO 639-5的条目;该列表在某些地方与639-2重叠并发生冲突,我认为Apple根本不会使用这些代码。
答案 1 :(得分:1)
Language-Tag = langtag ; normal language tags
/ privateuse ; private use tag
/ grandfathered ; grandfathered tags
langtag = language
["-" script]
["-" region]
*("-" variant)
*("-" extension)
["-" privateuse]
language = 2*3ALPHA ; shortest ISO 639 code
["-" extlang] ; sometimes followed by
; extended language subtags
/ 4ALPHA ; or reserved for future use
/ 5*8ALPHA ; or registered language subtag
privateuse = "x" 1*("-" (1*8alphanum))
grandfathered = irregular ; non-redundant tags registered
/ regular ; during the RFC 3066 era
看起来大多数BCP-47代码的第一段应该是有效的ISO-639代码,尽管它们可能不是三个字母的变体。 BCP-47语言代码有一些不是ISO-639代码的变体 - 即以x-
或i-
开头的代码以及与grandfathered
部分匹配的许多遗留代码语法:
irregular = "en-GB-oed" ; irregular tags do not match
/ "sgn-BE-FR" ; also includes i- prefixed codes
/ "sgn-BE-NL"
/ "sgn-CH-DE"
regular = "art-lojban" ; these tags match the 'langtag'
/ "cel-gaulish" ; production, but their subtags
/ "no-bok" ; are not extended language
/ "no-nyn" ; or variant subtags: their meaning
/ "zh-guoyu" ; is defined by their registration
/ "zh-hakka" ; and all of these are deprecated
/ "zh-min" ; in favor of a more modern
/ "zh-min-nan" ; subtag or sequence of subtags
/ "zh-xiang"
良好的开端将如下所示:
def extract_iso_code(bcp_identifier):
language, _ = bcp_identifier.split('-', 1)
if 2 <= len(language) <=3:
# this is a valid ISO-639 code or is grandfathered
else:
# handle non-ISO codes
raise ValueError(bcp_identifier)
从2字符变体到3字符变体的转换应该很容易处理,因为映射是众所周知的。