语言标记的正则表达式(由BCP47定义)

时间:2011-08-12 05:11:31

标签: regex bnf

我需要language tag定义的BCP 47的正则表达式。

我知道完整的BNF语法可以在http://www.rfc-editor.org/rfc/bcp/bcp47.txt获得,并且我可以用它来编写我自己的语法,但希望已经有一个。

4 个答案:

答案 0 :(得分:16)

看起来像这样:

^((?<grandfathered>(en-GB-oed|i-ami|i-bnn|i-default|i-enochian|i-hak|i-klingon|i-lux|
i-mingo|i-navajo|i-pwn|i-tao|i-tay|i-tsu|sgn-BE-FR|sgn-BE-NL|sgn-CH-DE)|(art-lojban|
cel-gaulish|no-bok|no-nyn|zh-guoyu|zh-hakka|zh-min|zh-min-nan|zh-xiang))|((?<language>
([A-Za-z]{2,3}(-(?<extlang>[A-Za-z]{3}(-[A-Za-z]{3}){0,2}))?)|[A-Za-z]{4}|[A-Za-z]{5,8})
(-(?<script>[A-Za-z]{4}))?(-(?<region>[A-Za-z]{2}|[0-9]{3}))?(-(?<variant>[A-Za-z0-9]{5,8}
|[0-9][A-Za-z0-9]{3}))*(-(?<extension>[0-9A-WY-Za-wy-z](-[A-Za-z0-9]{2,8})+))*
(-(?<privateUse>x(-[A-Za-z0-9]{1,8})+))?)|(?<privateUse>x(-[A-Za-z0-9]{1,8})+))$

以下是生成它的代码(在C#中):

var regular = "(art-lojban|cel-gaulish|no-bok|no-nyn|zh-guoyu|zh-hakka|zh-min|zh-min-nan|zh-xiang)";
var irregular = "(en-GB-oed|i-ami|i-bnn|i-default|i-enochian|i-hak|i-klingon|i-lux|i-mingo|i-navajo|i-pwn|i-tao|i-tay|i-tsu|sgn-BE-FR|sgn-BE-NL|sgn-CH-DE)";
var grandfathered = "(?<grandfathered>" + irregular + "|" + regular + ")";
var privateUse = "(?<privateUse>x(-[A-Za-z0-9]{1,8})+)";
var singleton = "[0-9A-WY-Za-wy-z]";
var extension = "(?<extension>" + singleton + "(-[A-Za-z0-9]{2,8})+)";
var variant = "(?<variant>[A-Za-z0-9]{5,8}|[0-9][A-Za-z0-9]{3})";
var region = "(?<region>[A-Za-z]{2}|[0-9]{3})";
var script = "(?<script>[A-Za-z]{4})";
var extlang = "(?<extlang>[A-Za-z]{3}(-[A-Za-z]{3}){0,2})";
var language = "(?<language>([A-Za-z]{2,3}(-" + extlang + ")?)|[A-Za-z]{4}|[A-Za-z]{5,8})";
var langtag = "(" + language + "(-" + script + ")?" + "(-" + region + ")?" + "(-" + variant + ")*" + "(-" + extension + ")*" + "(-" + privateUse + ")?" + ")";
var languageTag = @"^(" + grandfathered + "|" + langtag + "|" + privateUse + ")$";

Console.WriteLine(languageTag);

我无法保证其正确性(我可能已经拼写错误),但它在附录A中的示例中工作正常。

根据您的环境,您可能需要删除指定的捕获组"?<...>"

答案 1 :(得分:3)

适用于PHP的优化版本。

/^(?<grandfathered>(?:en-GB-oed|i-(?:ami|bnn|default|enochian|hak|klingon|lux|mingo|navajo|pwn|t(?:a[oy]|su))|sgn-(?:BE-(?:FR|NL)|CH-DE))|(?:art-lojban|cel-gaulish|no-(?:bok|nyn)|zh-(?:guoyu|hakka|min(?:-nan)?|xiang)))|(?:(?<language>(?:[A-Za-z]{2,3}(?:-(?<extlang>[A-Za-z]{3}(?:-[A-Za-z]{3}){0,2}))?)|[A-Za-z]{4}|[A-Za-z]{5,8})(?:-(?<script>[A-Za-z]{4}))?(?:-(?<region>[A-Za-z]{2}|[0-9]{3}))?(?:-(?<variant>[A-Za-z0-9]{5,8}|[0-9][A-Za-z0-9]{3}))*(?:-(?<extension>[0-9A-WY-Za-wy-z](?:-[A-Za-z0-9]{2,8})+))*)(?:-(?<privateUse>x(?:-[A-Za-z0-9]{1,8})+))?$/Di

答案 2 :(得分:0)

JavaScript会监管重复的命名捕获组,因此您必须将?<privateUse>的第二次使用更改为例如?<privateUse1>。编译为:

/^((?<grandfathered>(en-GB-oed|i-ami|i-bnn|i-default|i-enochian|i-hak|i-klingon|i-lux|i-mingo|i-navajo|i-pwn|i-tao|i-tay|i-tsu|sgn-BE-FR|sgn-BE-NL|sgn-CH-DE)|(art-lojban|cel-gaulish|no-bok|no-nyn|zh-guoyu|zh-hakka|zh-min|zh-min-nan|zh-xiang))|((?<language>([A-Za-z]{2,3}(-(?<extlang>[A-Za-z]{3}(-[A-Za-z]{3}){0,2}))?)|[A-Za-z]{4}|[A-Za-z]{5,8})(-(?<script>[A-Za-z]{4}))?(-(?<region>[A-Za-z]{2}|[0-9]{3}))?(-(?<variant>[A-Za-z0-9]{5,8}|[0-9][A-Za-z0-9]{3}))*(-(?<extension>[0-9A-WY-Za-wy-z](-[A-Za-z0-9]{2,8})+))*(-(?<privateUse>x(-[A-Za-z0-9]{1,8})+))?)|(?<privateUse1>x(-[A-Za-z0-9]{1,8})+))$/

这是一种构造方法:

let privateUseUsed = 0
const privateUse = () => "(?<privateUse" + (privateUseUsed++) + ">x(-[A-Za-z0-9]{1,8})+)"
const grandfathered = "(?<grandfathered>" +
      /* irregular */ (
        "en-GB-oed" +
          "|" + "i-(?:ami|bnn|default|enochian|hak|klingon|lux|mingo|navajo|pwn|tao|tay|tsu)" +
          "|" + "sgn-(?:BE-FR|BE-NL|CH-DE)"
      ) +
      "|" + /* regular */ (
        "art-lojban|cel-gaulish|no-bok|no-nyn|zh-guoyu|zh-hakka|zh-min|zh-min-nan|zh-xiang"
      ) +
      ")"
const langtag = "(" +
      "(?<language>" + (
        "([A-Za-z]{2,3}(-" +
          "(?<extlang>[A-Za-z]{3}(-[A-Za-z]{3}){0,2})" +
          ")?)|[A-Za-z]{4,8})"
      ) +
      "(-" + "(?<script>[A-Za-z]{4})" + ")?" +
      "(-" + "(?<region>[A-Za-z]{2}|[0-9]{3})" + ")?" +
      "(-" + "(?<variant>[A-Za-z0-9]{5,8}|[0-9][A-Za-z0-9]{3})" + ")*" +
      "(-" + "(?<extension>" + (
        /* singleton */ "[0-9A-WY-Za-wy-z]" +
          "(-[A-Za-z0-9]{2,8})+)"
      ) +
      ")*" +
      "(-" + privateUse() + ")?" +
      ")"
const languageTagReStr = "^(" + grandfathered + "|" + langtag + "|" + privateUse() + ")$";

编辑:结果是ff不支持命名的捕获组,因此您必须使用.replace(/\?<a-zA-Z>/g, '')将其删除,否则请先将它们删除:

const grandfathered = "(" +
      /* irregular */ "(en-GB-oed|i-ami|i-bnn|i-default|i-enochian|i-hak|i-klingon|i-lux|i-mingo|i-navajo|i-pwn|i-tao|i-tay|i-tsu|sgn-BE-FR|sgn-BE-NL|sgn-CH-DE)" +
      "|" +
      /* regular */ "(art-lojban|cel-gaulish|no-bok|no-nyn|zh-guoyu|zh-hakka|zh-min|zh-min-nan|zh-xiang)" +
      ")";
const langtag = "(" +
      "(" + (
        "([A-Za-z]{2,3}(-" +
          "([A-Za-z]{3}(-[A-Za-z]{3}){0,2})" +
          ")?)|[A-Za-z]{4}|[A-Za-z]{5,8})"
      ) +
      "(-" + "([A-Za-z]{4})" + ")?" +
      "(-" + "([A-Za-z]{2}|[0-9]{3})" + ")?" +
      "(-" + "([A-Za-z0-9]{5,8}|[0-9][A-Za-z0-9]{3})" + ")*" +
      "(-" + "(" + (
        /* singleton */ "[0-9A-WY-Za-wy-z]" +
          "(-[A-Za-z0-9]{2,8})+)"
      ) +
      ")*" +
      "(-" + "(x(-[A-Za-z0-9]{1,8})+)" + ")?" +
      ")";
const languageTag = RegExp("^(" + grandfathered + "|" + langtag + "|" + "(x(-[A-Za-z0-9]{1,8})+)" + ")$");

使用languageTag.test('en-us')

进行测试

答案 3 :(得分:0)

如果使用基于CLDR的功能集(例如PHP的intl扩展名),则可以使用以下函数检查intl数据库中是否存在语言环境:

<?php
 function is_locale($locale=''){
  // STANDARDISE INPUT
  $locale=locale_canonicalize($locale);

  // LOAD ARRAY WITH LOCALES
  $locales=resourcebundle_locales(NULL);

  // RETURN WHETHER FOUND
  return (array_search($locale,$locales)!==F);
 }
?>

加载和搜索数据大约需要半毫秒,因此不会对性能造成太大影响。

当然,它只会在所使用的PHP版本随附的CLDR版本的数据库中找到这些内容,但是会在以后的每个PHP版本中进行更新。