使用正则表达式在键值中拆分字符串

时间:2017-07-12 09:37:54

标签: php regex preg-match-all

我在解析samtools stats的纯文本输出时遇到了一些麻烦。

示例输出:

45205768 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
5203838 + 0 duplicates
44647359 + 0 mapped (98.76% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

我想逐行解析文件并在PHP数组中得到以下输出:

Array(
 "in total" => [45205768,0],
 ...
)

所以,长话短说,我想从行的前面得到数值作为整数数组和后面的字符串(没有括号)作为键。

3 个答案:

答案 0 :(得分:1)

^(\d+)\s\+\s(\d+)\s([a-zA-Z0-9 ]+).*$

此正则表达式将分别在匹配组1,2和3中放置第一个值,第二个值和以下字符串而不使用括号。

Regex101 demo

答案 1 :(得分:0)

我认为这就是你的事后:

^(\d+)(\s\+\s)(\d+)(.+)

See it work here on Regex101 拿起第一组和第三组

答案 2 :(得分:0)

这可以通过两个捕获组和全字符串匹配来解决。

我的模式准确地提取所需的子串并从待声明的"键中删除尾随空格":Pattern Demo

^(\d+) \+ (\d+) \K[a-z\d ]+(?=\s)  #244steps

PHP代码:(Demo

$txt='45205768 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
5203838 + 0 duplicates
44647359 + 0 mapped (98.76% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)';

preg_match_all('/^(\d+) \+ (\d+) \K[a-z\d ]+(?=\s)/m',$txt,$out);
foreach($out[0] as $k=>$v){
    $result[$v]=[(int)$out[1][$k],(int)$out[2][$k]];  // re-casting strings as integers
}
var_export($result);

输出:

array (
  'in total' => array (0 => 45205768, 1 => 0),
  'secondary' => array (0 => 0, 1 => 0),
  'supplementary' => array (0 => 0, 1 => 0),
  'duplicates' => array (0 => 5203838, 1 => 0),
  'mapped' => array (0 => 44647359, 1 => 0),
  'paired in sequencing' => array (0 => 0, 1 => 0),
  'read1' => array (0 => 0, 1 => 0),
  'read2' => array (0 => 0, 1 => 0),
  'properly paired' => array (0 => 0, 1 => 0),
  'with itself and mate mapped' => array (0 => 0, 1 => 0),
  'singletons' => array (0 => 0, 1 => 0),
  'with mate mapped to a different chr' => array ( 0 => 0, 1 => 0)
)

请注意,输入文本的最后两行在$result数组中生成一个重复键,这意味着较早行的数据会被后面一行的数据覆盖。如果这是一个问题,您可能会重新构建输入数据或只是将括号部分作为唯一性密钥的一部分。