PHP正则表达式匹配带有全部大写字母的行与偶尔的连字符

时间:2010-04-20 13:05:28

标签: php regex parsing

我正在尝试将现有的PHP正则表达式转换为适用于稍微不同的文档样式。

这是文档的原始样式:

**FOODS - TYPE A** 
___________________________________ 
**PRODUCT** 
1) Mi Pueblito Queso Fresco Authentic Mexican Style Fresh Cheese; 
2) La Fe String Cheese 
**CODE** 
Sell by date going back to February 1, 2009 

成功运行的PHP Regex匹配代码,如果该行被星号包围,则仅返回“true”,并将“ - ”的每一侧分别存储为$ m [1]和$ m [2]。

 if ( preg_match('#^\*\*([^-]+)(?:-(.*))?\*\*$#', $line, $m) ) { 
    // only for **header - subheader** $m[2] is set. 
    if ( isset($m[2]) ) { 
      return array(TYPE_HEADER, array(trim($m[1]), trim($m[2]))); 
    } 
    else { 
      return array(TYPE_KEY, array($m[1])); 
    } 
  } 

因此,对于第1行:$ m [1] =“FOODS”和$ m [2] =“TYPE A”; 第2行将被跳过;第3行:$ m [1] =“PRODUCT”等。

问题:如果标题没有有星号,我将如何重写上述正则表达式匹配,但仍然是全部大写,并且至少是4个字符长?例如:

FOODS - TYPE A 
___________________________________ 
PRODUCT
1) Mi Pueblito Queso Fresco Authentic Mexican Style Fresh Cheese; 
2) La Fe String Cheese 
CODE
Sell by date going back to February 1, 2009 

谢谢。

4 个答案:

答案 0 :(得分:2)

沿着(不要忘记Unicode正则表达式的“u”标志):

^(?:\*\*)?(?=[^*]{4,})(\p{Lu}+)(?:\s*-\s*(\p{Lu}+))?(?:\*\*)?\s*$
^               # start of line
(?:\*\*)?       # two stars, optional
(?=[^*]{4,})    # followed by at least 4 non-star characters
(\p{Lu}+)       # group 1, Unicode upper case letters
(?:             # start no capture group
  \s*-\s*       #   space*, dash, space*
  (\p{Lu}+)     #   group 2, Inicode upper case letters
)?              # end no capture group, make optional
(?:\*\*)?       # two stars, optional
\s*             # optional trailing spaces
$               # end of line

编辑:简化,根据评论:

^(?=[A-Z ]{4,})([A-Z ]+)(?:-([A-Z ]+))?\s*$
^               # start of line
(?=[A-Z -]{4,}) # followed by at least 4 upper case characters, spaces or dashes
([A-Z ]+)       # group 1, upper case letters or space
(?:             # start no capture group
  -             #   a dash
  ([A-Z ]+)     #   group 2, upper case letters or space
)?              # end no capture group, make optional
\s*             # optional trailing spaces
$               # end of line

第1组和第2组的内容必须在使用前进行修剪。

答案 1 :(得分:1)

^([A-Z]{4,}(?:[A-Z ]*[A-Z])?)(?:\s*-\s*([A-Z]{4,}(?:[A-Z ]*)?))?$

这个怎么样? 它将匹配至少4个字符的大写单词和一个至少4个大写字母的可选子标题。

答案 2 :(得分:1)

正则表达式:

^(?=.{4})([^-]+)(?:-(.*))?$

解释:

^          # start of line
(?=.{4})   # look ahead to make sure there are at least 4 characters
([^-]+)    # get all characters until it finds a dash character, if there is any
(?:-(.*))? # optional: skip the dash and continue get all characters until EOL
$          # end of line

我认为你只对至少有4个字符的行感兴趣。

另外,我作弊了一点,因此正则表达式将匹配任何字符,而不仅仅是英文大写字母,因为它会导致更简单的表达。无论如何,如果你想确保它只接受大写字母,这应该这样做:

^(?=.{4})([A-Z\s]+)(?:-([A-Z\s]+))?$

答案 3 :(得分:0)

所以你需要知道的是标题以四个大写的ASCII字母开头?这应该有效:

'#^([A-Z]{4}[^-]*)(?:-(.*))?$#'