我在解析samtools stats
的纯文本输出时遇到了一些麻烦。
示例输出:
45205768 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
5203838 + 0 duplicates
44647359 + 0 mapped (98.76% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
我想逐行解析文件并在PHP数组中得到以下输出:
Array(
"in total" => [45205768,0],
...
)
所以,长话短说,我想从行的前面得到数值作为整数数组和后面的字符串(没有括号)作为键。
答案 0 :(得分:1)
答案 1 :(得分:0)
答案 2 :(得分:0)
这可以通过两个捕获组和全字符串匹配来解决。
我的模式准确地提取所需的子串并从待声明的"键中删除尾随空格":Pattern Demo
^(\d+) \+ (\d+) \K[a-z\d ]+(?=\s) #244steps
PHP代码:(Demo)
$txt='45205768 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
5203838 + 0 duplicates
44647359 + 0 mapped (98.76% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)';
preg_match_all('/^(\d+) \+ (\d+) \K[a-z\d ]+(?=\s)/m',$txt,$out);
foreach($out[0] as $k=>$v){
$result[$v]=[(int)$out[1][$k],(int)$out[2][$k]]; // re-casting strings as integers
}
var_export($result);
输出:
array (
'in total' => array (0 => 45205768, 1 => 0),
'secondary' => array (0 => 0, 1 => 0),
'supplementary' => array (0 => 0, 1 => 0),
'duplicates' => array (0 => 5203838, 1 => 0),
'mapped' => array (0 => 44647359, 1 => 0),
'paired in sequencing' => array (0 => 0, 1 => 0),
'read1' => array (0 => 0, 1 => 0),
'read2' => array (0 => 0, 1 => 0),
'properly paired' => array (0 => 0, 1 => 0),
'with itself and mate mapped' => array (0 => 0, 1 => 0),
'singletons' => array (0 => 0, 1 => 0),
'with mate mapped to a different chr' => array ( 0 => 0, 1 => 0)
)
请注意,输入文本的最后两行在$result
数组中生成一个重复键,这意味着较早行的数据会被后面一行的数据覆盖。如果这是一个问题,您可能会重新构建输入数据或只是将括号部分作为唯一性密钥的一部分。