如何将IETF BCP 47语言标识符转换为ISO-639-2?

时间:2014-09-28 13:53:16

标签: python ios iso-639-2 ietf-bcp-47

我正在为iOS应用程序编写服务器API。作为初始化过程的一部分,应用程序应通过API调用将电话界面语言发送到服务器。

问题在于Apple在其IETF BCP 47 language identifier中使用了名为NSLocale preferredLanguages function的内容。

返回的值有不同的长度(例如[aa, ab, ace, ach, ada, ady, ae, af, afa, afh, agq, ...],我发现很少有解析器可以将此代码转换为正确的语言标识符。

我想使用更常见的ISO-639-2 three-letters language identifier,它无处不在,有许多语言的解析器,并且有一个标准的3个字母的语言表示。

如何将IETF BCP 47语言标识符转换为ISO-639-2三字母语言标识符,最好是在Python中?

2 个答案:

答案 0 :(得分:7)

BCP 47标识符以2个字母ISO 639-1 3个字母639-2,639-3或639-5语言代码开头;请参阅RFC 5646 Syntax section

Language-Tag  = langtag             ; normal language tags
              / privateuse          ; private use tag
              / grandfathered       ; grandfathered tags

langtag       = language
                ["-" script]
                ["-" region]
                *("-" variant)
                *("-" extension)
                ["-" privateuse]

language      = 2*3ALPHA            ; shortest ISO 639 code
                ["-" extlang]       ; sometimes followed by
                                    ; extended language subtags
              / 4ALPHA              ; or reserved for future use
              / 5*8ALPHA            ; or registered language subtag

我不希望Apple使用privateusegrandfathered表单,因此您可以假设您正在考虑ISO 639-1,ISO 639-2,ISO 639-3或ISO这里有639-5种语言代码。只需将2个字母的ISO-639-1代码映射到3个字母的ISO 639- *代码即可。

您可以使用pycountry package

import pycountry

lang = pycountry.languages.get(alpha2=two_letter_code)
three_letter_code = lang.terminology

演示:

>>> import pycountry
>>> lang = pycountry.languages.get(alpha2='aa')
>>> lang.terminology
u'aar'

术语表格是首选的3字母代码;还有一个参考书目表单,它仅对22个条目有所不同。见ISO 639-2 B and T codes。该套餐不包括ISO 639-5的条目;该列表在某些地方与639-2重叠并发生冲突,我认为Apple根本不会使用这些代码。

答案 1 :(得分:1)

来自RFC5646/BCP47

Language-Tag  = langtag             ; normal language tags
              / privateuse          ; private use tag
              / grandfathered       ; grandfathered tags

langtag       = language
                ["-" script]
                ["-" region]
                *("-" variant)
                *("-" extension)
                ["-" privateuse]

language      = 2*3ALPHA            ; shortest ISO 639 code
                ["-" extlang]       ; sometimes followed by
                                    ; extended language subtags
              / 4ALPHA              ; or reserved for future use
              / 5*8ALPHA            ; or registered language subtag

privateuse    = "x" 1*("-" (1*8alphanum))

grandfathered = irregular           ; non-redundant tags registered
              / regular             ; during the RFC 3066 era

看起来大多数BCP-47代码的第一段应该是有效的ISO-639代码,尽管它们可能不是三个字母的变体。 BCP-47语言代码有一些不是ISO-639代码的变体 - 即以x-i-开头的代码以及与grandfathered部分匹配的许多遗留代码语法:

irregular     = "en-GB-oed"         ; irregular tags do not match
              / "sgn-BE-FR"         ; also includes i- prefixed codes
              / "sgn-BE-NL"
              / "sgn-CH-DE"

regular       = "art-lojban"        ; these tags match the 'langtag'
              / "cel-gaulish"       ; production, but their subtags
              / "no-bok"            ; are not extended language
              / "no-nyn"            ; or variant subtags: their meaning
              / "zh-guoyu"          ; is defined by their registration
              / "zh-hakka"          ; and all of these are deprecated
              / "zh-min"            ; in favor of a more modern
              / "zh-min-nan"        ; subtag or sequence of subtags
              / "zh-xiang"

良好的开端将如下所示:

def extract_iso_code(bcp_identifier):
    language, _ = bcp_identifier.split('-', 1)
    if 2 <= len(language) <=3:
        # this is a valid ISO-639 code or is grandfathered
    else:
        # handle non-ISO codes
        raise ValueError(bcp_identifier)

从2字符变体到3字符变体的转换应该很容易处理,因为映射是众所周知的。