解码vtt时间戳数据

时间:2019-03-29 22:35:10

标签: php preg-match

我有数百行来自VTT字幕文件,来自suttitle的示例是

00:01:03.500 --> 00:01:03.510 align:start position:0%
<c.colorCCCCCC>fourth guess it came from a broken</c><c.colorE5E5E5> home
 </c>

00:01:03.510 --> 00:01:08.140 align:start position:0%
<c.colorCCCCCC>fourth guess it came from a broken</c><c.colorE5E5E5> home
a<00:01:04.580><c> father</c><00:01:05.580><c> not</c><00:01:05.820><c> being</c><00:01:05.880><c> there</c><00:01:06.890><c> my</c><00:01:07.890><c> mother</c></c>

00:01:08.140 --> 00:01:08.150 align:start position:0%
a<c.colorE5E5E5> father not being there my mother
 </c>

00:01:08.150 --> 00:01:13.429 align:start position:0%
a<c.colorE5E5E5> father not being there my mother</c>
<c.colorE5E5E5>getting<00:01:09.150><c> married</c><00:01:09.630><c> and</c><00:01:11.360><c> the</c><00:01:12.360><c> abuse</c></c><c.colorCCCCCC><00:01:12.659><c> started</c><00:01:13.049><c> at</c></c>

00:01:13.429 --> 00:01:13.439 align:start position:0%
<c.colorE5E5E5>getting married and the abuse</c><c.colorCCCCCC> started at
 </c>

VTT字幕文件非常令人困惑,但是目标是要抓住时间戳标记和时间戳本身中的所有单词。 我当时在想预赛,但不知道该怎么做

$pattern = "<([^;]*)>";
preg_match_all($pattern, $lineContent, $allintag);

是我得到的但停在那儿。

array(
00:01:03.510,
00:01:04.580,
00:01:05.58,
00:01:05.820,
00:01:05.880,
00:01:06.890,
00:01:07.890,
00:01:08.140,
00:01:09.150,
00:01:09.630,
00:01:11.360,
00:01:12.360,
00:01:12.659,
00:01:13.049
)
array(
'fourth guess it came from a broken home',
'father',
'not',
'being',
'there',
'my',
'mother',
'getting',
'married',
'and',
'the',
'abuse',
'started',
'at'
)

1 个答案:

答案 0 :(得分:0)

您可以使用

'~<(?<time>\d{2}:\d{2}:\d{2}\.\d+)><c>\s*(?<text>.*?)</c>~'

如果时间和文本组不是连续使用的

'~<(?<time>\d{2}:\d{2}:\d{2}\.\d+)>|<c>\s*(?<text>.*?)</c>~'

请参见regex demo

详细信息

  • <-一个<字符
  • (?<time>\d{2}:\d{2}:\d{2}\.\d+)-组“时间”:2位数字,:,2位数字,:,2位数字,.,然后是1位以上数字
  • >-一个>字符
  • <c>-文字<c>文字
  • \s*-超过0个空格
  • (?<text>.*?)-组“文本”:除换行符以外的任何0+个字符,并且尽可能少
  • </c>-文字</c>文本。

请参见PHP demo

$lineContent = "<00:01:13.650><c> time</c><00:01:13.920><c> and</c> 
 <00:01:14.780><c> that's</c><00:01:15.780><c> what</c>";
if (preg_match_all('~<(?<time>\d{2}:\d{2}:\d{2}\.\d+)><c>\s*(?<text>.*?)</c>~', $lineContent, $allintag)) {
    print_r($allintag["time"]);
    print_r($allintag["text"]);
}

输出:

Array ( [0] => 00:01:13.650 [1] => 00:01:13.920 [2] => 00:01:14.780 [3] => 00:01:15.780 )
Array ( [0] => time [1] => and  [2] => that's  [3] => what )