比较文件,以便从多行正则表达式中提取文本:compare-object,hastables或psobject?

时间:2014-10-29 13:19:41

标签: regex powershell

我的代码的开头如下:

$file1 = "G:\test_powershell_subtitle\The Big Bang Theory - 08x06 - french.srt"
$file2 = "G:\test_powershell_subtitle\The Big Bang Theory - 08x06 - english.srt"
$text1 =get-content($file1)  -Raw
$text2 =get-content($file2)  -Raw


$regex = [regex]'(?m)(?<sequence>\d+)\r\n(?<timecode>\d{2}:\d{2}:\d{2},\d{3} --> \d{2}:\d{2}:\d{2},\d{3})\r\n(?<text>[\s\S]*?\r\n\r\n)'

$matches = $regex.Matches($text1)
 $matches  |% {
    if ($_ -match $regex){

    new-object psobject -property @{

                    sequence = $matches['sequence']
                    timecode = $matches['timecode']
                    text     = $matches['text']

                    }
      } 
}

输出:

timecode                                    sequence                text                            
----                                         --------                ----                                       
00:00:02,880 --> 00:00:04,146                1                       I like your suit....       
00:00:04,148 --> 00:00:06,699                2                       Oh, thanks. Got a ...          
00:00:06,701 --> 00:00:08,651                3                       How does it feel knowing...     
00:00:08,653 --> 00:00:10,786                4                       is to go out...        

我的目标是根据时间码将不同语言的字幕合并到一个文件中。

最好的方法是什么? compare-object,hastables还是psobject?

感谢您的帮助。

2 个答案:

答案 0 :(得分:0)

你将有更多工作要做,但这应该足以满足手头的问题。 Group-Object是我想到的方式。

function Convert-SubtitlesToObject{
    param(
        [parameter(Mandatory=$true)]
        [ValidateScript({Test-Path $_})]
        [String]
        $Path
    )
    $regex = [regex]'(?m)(?<sequence>\d+)\r\n(?<timecode>\d{2}:\d{2}:\d{2},\d{3} --> \d{2}:\d{2}:\d{2},\d{3})\r\n(?<text>[\s\S]*?\r\n\r\n)'
    $text = Get-Content($Path) -Raw
    $matches = $regex.Matches($text)
    $matches | Where-Object{$_ -match $regex} | ForEach-Object{ 
        [PSCustomObject][ordered]@{
            sequence = $matches['sequence']
            timecode = $matches['timecode']
            text     = $matches['text']
        }
    }
}

$englishSubs = Convert-SubtitlesToObject -Path 'C:\temp\put\The Big Bang Theory - 8x06 - The Expedition Approximation.HDTV.LOL.HI.en.srt'
$frenchSubs = Convert-SubtitlesToObject -Path 'C:\temp\put\The Big Bang Theory - 8x06 - The Expedition Approximation.HDTV.LOL.fr.srt'

$collection = @()
$collection += $englishSubs
$collection += $frenchSubs 

$sequence = 0
$collection | Group-Object timecode | Select-Object Name,@{l="Text";e={$_.Group.Text}} | ForEach-Object{
    $sequence++
    Write-Output "$sequence`r`n$($_.Name)`r`n$($_.Text)"
}

我将您的代码转换为函数,因为您将为所有文件调用该函数。对英语和法语潜艇运行此命令并将它们放入较大的$collection。在该集合上调用Group-Object并按时间码对它们进行分组。获取该数据并将文本扩展为单个字段。毕竟收集输出以最好地模仿字幕文件。你需要注意那些不匹配的时间码,但我会告诉你在那种情况下你会做什么。

以下是一些示例输出,您可以将其输入Out-FileAdd-Content

1
00:00:00,000 --> 00:00:01,800
English Subtitles (HI) 
[MP4] The Big Bang Theory S08E06 (720p) The Expedition Approximation HDTV [KoTuWa]


2
00:00:02,880 --> 00:00:04,146
I like your suit.

 J'aime ton tailleur.


3
00:00:04,148 --> 00:00:06,699
Oh, thanks. Got a couple
new outfits for work.

 Merci.
J'en ai acheté pour le boulot.

<强>声明

我知道一些PowerShell。我知道关于字幕文件格式的杰克。

答案 1 :(得分:0)

很抱歉我迟到的回复。我试图自己找到解决方案。它不完整,特别是如果时间码不相同。你的更好。

这是我的解决方案。

function Convert-SubtitlesToObject{
param(
    [parameter(Mandatory=$true)]
    [ValidateScript({Test-Path $_})]
    [String]
    $Path
)
$regex = [regex]'(?m)(?<sequence>\d+)\r\n(?<timecode>(?<t1>\d{2}:\d{2}:\d{2},\d{3}) --> \d{2}:\d{2}:\d{2},\d{3})\r\n(?<text>[\s\S]*?\r\n\r\n)'
$text = Get-Content($Path) -Raw
$matches = $regex.Matches($text)
$matches | Where-Object{$_ -match $regex} | ForEach-Object{ 
    [PSCustomObject][ordered]@{
        sequence = $matches['sequence']
        timecode = $matches['timecode']
        text     = $matches['text']
    }
}
}

$englishSubs = Convert-SubtitlesToObject -Path 'G:\test_powershell_subtitle\The Big Bang Theory - 08x06 - english.srt'
$frenchSubs = Convert-SubtitlesToObject -Path 'G:\test_powershell_subtitle\The Big Bang Theory - 08x06 - french.srt'

$temp = Compare-Object $frenchSubs $englishSubs -property sequence,timecode,text

 $subtitles=$temp | Group-Object -Property timecode| % { 
    [PSCustomObject][ordered] @{
            seq=$_.group[1].sequence;
            time=$_.name;
            string=$_.group[0].text+$_.group[1].text}}  

#Construct an out-array to use for data export
$OutArray = @()
$Outarray +=$subtitles.psobject.properties | % {$_.value} # each object's fields

# get the index for element that is -eq to SyncRoot
# The SyncRoot is returning the collection

$index = 0..($outarray.psobject.properties.name.length - 1) | ? {$outarray.psobject.properties.name[$_] -eq "SyncRoot"}

for($i = $index; $i -le $OutArray.matches.count; $i++){

Write-Output "$($outarray[$i].seq)`r`n$($outarray[$i].time)`r`n$($outarray[$i].string)`r`n"
}