从.seg文件中提取数据

时间:2017-08-21 09:04:58

标签: linux bash shell speech-recognition

我有一个.seg文件,它保存音频文件的二次化后形成的簇的数据。 该文件包含以下数据:

;; cluster S0 [ score:FS = -32.694324625945725 ] [ score:FT = 
-33.32942628147711 ] [ score:MS = -32.847416329096404 ] [ score:MT = 
-33.45196981196905 ] 
ElonN 1 0 758 F S U S0
;; cluster S1 [ score:FS = -33.14490351155562 ] [ score:FT = 
-33.420111126893076 ] [ score:MS = -32.29039025858266 ] [ score:MT = 
-32.85038927851203 ] 
ElonN 1 758 308 M S U S1
ElonN 1 1110 700 M S U S1
ElonN 1 1887 2794 M S U S1
ElonN 1 4849 1190 M S U S1
;; cluster S10 [ score:FS = -34.466969784129404 ] [ score:FT = 
-34.951981832991414 ] [ score:MS = -34.83408030011385 ] [ score:MT = 
-35.17326803680231 ] 
ElonN 1 6731 352 F S U S10
;; cluster S11 [ score:FS = -33.57333115273301 ] [ score:FT = 
-33.93961876513661 ] [ score:MS = -32.6529742867516 ] [ score:MT = 
-33.397218081762475 ] 
ElonN 1 7459 2542 M S U S11
;; cluster S16 [ score:FS = -33.29482735979043 ] [ score:FT = 
-33.687616298740195 ] [ score:MS = -32.189984103971135 ] [ score:MT = 
-33.13899965310298 ] 
ElonN 1 10001 3051 M S U S16
ElonN 1 13086 912 M S U S16
;; cluster S9 [ score:FS = -33.4457701986847 ] [ score:FT = 
-34.70059869569136 ] [ score:MS = -33.958162156208914 ] [ score:MT = 
-34.79598011488008 ] 
ElonN 1 6039 692 F S U S9

我必须提取开始时间(第3列),发言时间(第4列)和最后一列(发言人姓名)。

在以下段中

ElonN 1 6039 692 F S U S9

6039是该细分受众群的开始时间。 692是段的持续时间。 S9是演讲者名称。

我写的以下shell脚本提取整个段并存储在一个文件中。

echo "Enter audio file name. (File must be of .wav format)"

read fileName

echo "Enter path of the audio file"

read path

echo "Enter folder name"

read outputfolder 

mkdir -p $outputfolder

echo "Processing $fileName"
./ilp_diarization2.sh $path/$fileName.wav 120 $outputfolder


grep "$fileName.*S" $outputfolder/$fileName/$fileName.g.3.seg > a


cat a

1 个答案:

答案 0 :(得分:2)

您可以使用wak等:

var=$(awk '{ print $3" "$4" "$NF }' filename)

awk '{ print $3" "$4" "$NF }' filename > outputfile

$ number是指您关注的空格分隔(awk的默认)数据。