Mindboggling正则表达式将Whitespace-Comma-Whitespace输入字符串转换为数组。必须支持引用

时间:2013-09-08 00:37:54

标签: php regex string preg-match-all delimited

这是我解决此问题的最佳尝试(目前为止)。我是正则表达式的新手,这个问题非常重要,但我会试一试。 RegEx显然需要一些时间来掌握。

这似乎满足分隔符/逗号要求。对我而言,由于重复/s*,这似乎是多余的。可能有更好的方法。

/\s*[,|\s*]\s*/

我在SOF上找到了这个并且我试图把它拆开并应用于我的问题(不容易)。这似乎满足了大多数“引用”要求,但我仍在研究如何解决以下要求中的分隔符问题。

/"(?:\\\\.|[^\\\\"])*"|\S+/

我正在努力满足的要求:

  • PHP preg_match_all()(或类似)函数将使用它将字符串分解为字符串数组。源语言是PHP。
  • 输入字符串中的单词由(0个或更多个空格)(可选逗号)(0个或更多个空格)或仅(1个或更多个空格)分隔。
  • 输入字符串也可以包含引用的子字符串,这些子字符串将成为输出数组中的单个元素。
  • 输入字符串中的引用子字符串在放入输出数组时必须保留它们的双引号(因为我们必须能够在以后将它们识别为最初在输入字符串中引用)。
  • 当放入输出数组时,必须删除引用子字符串中的前导和尾随空格(即双引号字符和字符串本身之间的空格)。示例:“< space> hello< space> world< space>< tab>”成为“你好< space>世界”
  • 输入字符串中引用短语中的空格在放入其输出数组元素时必须缩减为单个空格。示例:“hello< space>< tab>< space>< space> world”变为“hello< space> world”
  • 输入字符串中带引号的零长度或仅包含空格的子字符串不会放入输出数组中(输出数组不能包含任何零长度元素)。
  • 必须为空格修剪输出数组的每个元素(左和右)。

此示例演示了上述所有要求:

输入字符串:

  

“”一“二三”四,五“六七”“”

返回此数组(双引号实际存在于下面显示的字符串中):

  

{one,“two three”,four,five,“six seven”}

编辑2013年9月13日

我一直在努力研究正则表达式,并最终确定了这个提议的解决方案。它可能不是最好的,但这是我现在所拥有的。

我将使用此正则表达式使用PHP的preg_match_all()函数将搜索字符串拆分为数组:

/(?:"([^"]*)"|([^\s",]+))/

php函数preg_match_all()需要前导/尾随“/”。

现在已经创建了数组,我们从函数调用中检索它,如下所示:

$x = preg_match_all(REGEX);
$Array = $x[0];

我们必须这样做,因为函数返回一个复合数组,元素0包含正则表达式的实际输出。其他返回的元素包含正则表达式捕获的值,我们不需要这些值。

现在,我将迭代生成的数组并处理每个元素以满足要求(上面),这比使用单个正则表达式在单个步骤中满足所有要求更容易

1 个答案:

答案 0 :(得分:0)

我终于为这个问题开发了一个解决方案,它涉及一些使用正则表达式的PHP语句。以下是最终功能。

这个功能是一个类的一部分,这就是它以“public”开头的原因。

public function SearchString_ToArr($SearchString) {
    /*
    Purpose
        Used to parse the specified search string into an array of search terms.
        Search terms are delimited by <0 or more whitespace><optional comma><0 or more whitespace>
    Parameters
        SearchString (string) = The search string we're working with.
    Return (array)
        Returns an array using the following rules to parse the specified search string:
            - Each search term from the search string is converted to a single element in the returned array.
            - Search terms are delimited by whitespace and/or commas, or they may be double quoted.
            - Double-quoted search terms may contain multiple words.
        Unquoted Search Terms:
            - These are delimited by any number of whitespace characters or commas in the search string.
            - These have all leading and trailing whitespace trimmed.
        Quoted Search Terms:
            - These are surrounded by double-quotes in the search string.
            - These retain leading and trailing double-quotes in the returned array.
            - These have all leading and trailing whitespace trimmed.
            - These may contain whitespace.
            - These have all containing whitespace converted into a single space.
            - If these are zero-length or contain only whitespace, they are not included in the returned array.
        Example 1:
            SearchString =  ' "" one " two   three " four "five six" " " '
            Returns {"one", ""two three"", "four", ""five six""}
            Notes   The leading whitespace before the first "" is not returned.
                    The first quoted phrase ("") is empty so it is not returned.
                    The term "one" is returned with leading and trailing whitespace removed.
                    The phrase "two three" is returned with leading and trailing whitspace removed.
                    The phrase "two three" has containing whitespace converted to a single space.
                    The phrase "two three" has leading and trailing double-quotes retained.
                    ...
    Version History
        1.0 2013.09.18 Tested by Russ Tanner on PHP 5.3.10.
    */

    $r = array();
    $Matches = array();

    // Split the search string into an array based on whitespace, commas, and double-quoted phrases.
    preg_match_all('/(?:"([^"]*)"|([^\s",]+))/', $SearchString, $Matches);
    // At this point:
    //  1. all quoted strings have their own element and begin/end with the quote character.
    //  2. all non-quoted strings have their own element and are trimmed.
    //  3. empty strings are omitted.

    // Normalize quoted elements...
    // Convert all internal whitespace to a single space.
    $r = preg_replace('/\s\s+/', ' ', $Matches[0]);
    // Remove all whitespace between the double-quotes and the string.
    $r = preg_replace('/^"\s+/', '"', $r);
    $r = preg_replace('/\s+"$/', '"', $r);

    return $r;
}