RegEx匹配字幕中的SRT和VTT语法

时间:2019-05-08 22:37:21

标签: php regex regex-group srt vtt

我同时具有srt和vtt格式的字幕,在这里我需要匹配并删除特定于格式的语法,并获得清晰的文本行。

我想出了这个正则表达式: /\n?\d*?\n?^.* --> [012345]{2}:.*$/m

样本内容(同时混合srt和vtt):

1
00:00:04,019 --> 00:00:07,299
line1
line2

2
00:00:07,414 --> 00:00:09,155
line1

00:00:09,276 --> 00:00:11,429
line1

00:00:11,549 --> 00:00:14,874
line1
line2

这与https://regex101.com/r/zRsRMR/2/中模拟的字幕号和时间都匹配

但是当在代码本身中使用时(甚至直接使用从https://regex101.com生成的代码段),这只会匹配时间,而不会匹配字幕编号。

查看输出:

array (5)
0 => array (1)
0 => "00:00:04,019 --> 00:00:07,299
" (30)
1 => array (1)
0 => "
00:00:07,414 --> 00:00:09,155
" (31)
2 => array (1)
0 => "
00:00:09,276 --> 00:00:11,429
" (31)
3 => array (1)
0 => "
00:00:11,549 --> 00:00:14,874
" (31)
4 => array (1)
0 => "
00:00:11,549 --> 00:00:14,874
" (31)

可以在http://sandbox.onlinephpfunctions.com/code/dec294251b879144f40a6d1bdd516d2050321242

上进行测试

目标是甚至匹配字幕号,例如,第一个预期匹配项应该是:

1
00:00:04,019 --> 00:00:07,299

2 个答案:

答案 0 :(得分:3)

我不确定,这是否是您想要捕获的内容。但是,原因是您可能希望将字符串与捕获组一起包装,以使其易于获取。例如,this expression示例了捕获组如何围绕所需字符进行工作:

^([0-9]+\n|)([0-9:,->\s]+)

enter image description here

这可能不是这样做的方法,也不是最好的表达方式。但是,它可能会给您一个以不同的方式解决问题的想法。

我猜想您可能想捕获日期时间行和之前的行,它们可能有数字也可能没有数字。

此图显示了表达式的工作方式,您可以在此link中可视化其他表达式:

enter image description here

在将数据发送到RegEx引擎之前,您可能想要编写一个脚本来清理数据,以便获得一个简单的表达式。

使用JavaScript进行示例测试

const regex = /^([0-9]+\n|)([0-9:,->\s]+)/mg;
const str = `1
00:00:04,019 --> 00:00:07,299
line1
line2

2
00:00:07,414 --> 00:00:09,155
line1

00:00:09,276 --> 00:00:11,429
line1

00:00:11,549 --> 00:00:14,874
line1
line2
`;
let m;

while ((m = regex.exec(str)) !== null) {
    // This is necessary to avoid infinite loops with zero-width matches
    if (m.index === regex.lastIndex) {
        regex.lastIndex++;
    }
    
    // The result can be accessed through the `m`-variable.
    m.forEach((match, groupIndex) => {
        console.log(`Found match, group ${groupIndex}: ${match}`);
    });
}

PHP测试

这可能不会生成您想要的输出,只是一个示例:

$re = '/^([0-9]+\n|)([0-9:,->\s]+)/m';
$str = '1
00:00:04,019 --> 00:00:07,299
line1
line2

2
00:00:07,414 --> 00:00:09,155
line1

00:00:09,276 --> 00:00:11,429
line1

00:00:11,549 --> 00:00:14,874
line1
line2
';

preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);

foreach ($matches[0] as $key => $value) {
    if ($value == "") {
        unset($matches[0][$key]);
    } else {
        $matches[0][$key] = trim($value);
    }

}

var_dump($matches[0]);

性能测试

此JavaScript代码段使用简单的100万次for循环来显示该表达式的性能。

repeat = 1000000;
start = Date.now();

for (var i = repeat; i >= 0; i--) {
	var string = '2  \n00:00:07,414 --> 00:00:09,155';
	var regex = /(.*)([0-9:,->\s]+)/gm;
	var match = string.replace(regex, "$2");
}

end = Date.now() - start;
console.log("YAAAY! \"" + match + "\" is a match  ");
console.log(end / 1000 + " is the runtime of " + repeat + " times benchmark test.  ");

如果您希望在一个变量中捕获所有所需的输出,则只需在整个表达式周围添加一个捕获组,然后使用$1对其进行调用。

如果需要,还可以添加或减少边界,例如this one

^(?:[0-9]+\n|\n)(([0-9:,]+)([\s->]+)([0-9:,]+))$

enter image description here

enter image description here

使用JavaScript测试第二个表达式的示例

const regex = /^(?:[0-9]+\n|\n)(([0-9:,]+)([\s->]+)([0-9:,]+))$/gm;
const str = `1
00:00:04,019 --> 00:00:07,299
- cdcdc
- cddcd

2
00:00:07,414 --> 00:00:09,155
54564

00:00:09,276 --> 00:00:11,429
- 445454 - ccd
- cdscdcdcd

00:00:11,549 --> 00:00:14,874
line1
line2
`;
let m;

while ((m = regex.exec(str)) !== null) {
    // This is necessary to avoid infinite loops with zero-width matches
    if (m.index === regex.lastIndex) {
        regex.lastIndex++;
    }
    
    // The result can be accessed through the `m`-variable.
    m.forEach((match, groupIndex) => {
        console.log(`Found match, group ${groupIndex}: ${match}`);
    });
}

答案 1 :(得分:2)

您可以将表达式\n?\d*?\n?的这一部分设置为可选组,以匹配1个以上的数字,后跟换行符。字符类[012345]也可以写成[0-5]

您可以将表达式更新为:

^(?:\d+\n)?.*\h+-->\h+[0-5]{2}:.*$
  • ^字符串的开头
  • (?:\d+\n)?可选的1个以上的数字和换行符
  • .*\h+-->\h+ Match 0+ times any char except newline, 1+ horizontal whitespace chars,->`和1个以上水平空格字符
  • [0-5]{2}:匹配2次0-5
  • .*匹配除换行符以外的任意字符0+次
  • $字符串结尾

Regex demo | Php demo