Question

我有一个文件，其中包含来自apache http logs的自动生成的统计数据。

我真的在努力研究如何匹配两段文字之间的界限。这是我拥有的stat文件的一部分：

jpg 6476 224523785 0 0
Unknown 31200 248731421 0 0
gif 197 408771 0 0
END_FILETYPES

# OS ID - Hits
BEGIN_OS 12
linuxandroid 1034
winlong 752
winxp 1320
win2008 204250
END_OS

# Browser ID - Hits
BEGIN_BROWSER 79
mnuxandroid 1034
winlong 752
winxp 1320

我正在尝试做的是写一个正则表达式，仅在标记BEGIN_OS 12和END_OS之间进行搜索。

我想创建一个包含操作系统和命中的PHP数组，例如（我知道实际的数组实际上并不完全像这样，但只要我有这些数据）：

array(
   [0] => array(
      [0] => linuxandroid
      [1] => winlong
      [2] => winxp
      [3] => win2008
   )
   [1] => array(
      [0] => 1034
      [1] => 752
      [2] => 1320
      [3] => 204250
   )
)

我已经用gskinner regex测试器测试正则表达式了好几个小时，但正则表达式远远不是我的强项。

我会发布到目前为止我已经发布的内容，但我已经尝试了加载，而我最接近的是：

^[BEGIN_OS\s12]+([a-zA-Z0-9]+)\s([0-9]+)

这太可怜了！

任何帮助都会受到赞赏，即使它是“它无法完成”。

Answer 1

正则表达式可能不是这项工作的最佳工具。您可以使用正则表达式获取所需的子字符串，然后使用PHP的字符串操作函数进行进一步处理。

$string = preg_replace('/^.*BEGIN_OS \d+\s*(.*?)\s*END_OS.*/s', '$1', $text);

foreach (explode(PHP_EOL, $string) as $line) {
    list($key, $value) = explode(' ', $line);
    $result[$key] = $value;
}

print_r($result);

应该给你以下输出：

Array
(
    [linuxandroid] => 1034
    [winlong] => 752
    [winxp] => 1320
    [win2008] => 204250
)

Answer 2

您可以尝试以下方式：

/BEGIN_OS 12\s(?:([\w\d]+)\s([\d]+\s))*END_OS/gm

您还必须为结果解析匹配，您也可以使用以下内容对其进行简化：

/BEGIN_OS 12([\s\S]*)END_OS/gm

然后只需解析第一个组（它们之间的文本）并在'\n'然后' '上拆分以获得您想要的部分。

修改

带有评论的正则表达式：

/BEGIN_OS 12 // Match "BEGIN_OS 12" exactly \s // Match a whitespace character after (?: // Begin a non-capturing group ([\w\d]+) // Match any word or digit character, at least 1 or more \s // Match a whitespace character ([\d]+\s) // Match a digit character, at least one or more )* // End non-capturing group, repeate group 0 or more times END_OS // Match "END_OS" exactly /gm // global search (g) and multiline (m)

简单版本：

/BEGIN_OS 12 // Match "BEGIN_OS 12" exactly ( // Begin group [\s\S]* // Match any whitespace/non-whitespace character (works like the '.' but captures newlines ) // End group END_OS // Match "END_OS" exactly /gm // global search (g) and multiline (m)

辅助编辑

您的尝试：

^[BEGIN_OS\s12]+([a-zA-Z0-9]+)\s([0-9]+)

不会给你预期的结果。如果你把它分开：

^ // Match the start of a line, without 'm' this means the beginning of the string. [BEGIN_OS\s12]+ // This means, match a character that is any [B, E, G, I, N, _, O, S, \s, 1, 2] // where there is at least 1 or more. While this matches "BEGIN_OS 12" // it also matches any other lines that contains a combination of those // characters or just a line of whitespace thanks to \s). ([a-zA-Z0-9]+) // This should match the part you expect, but potentially not with the previous rules in place. \s ([0-9]+) // This is the same as [\d]+ or \d+ but should match what you expect (again, potentially not with the first rule)

用于文本之间匹配的正则表达式

2 个答案: