不会生成预期的输出文件

时间:2017-04-19 09:04:49

标签: bash sorting sed cut

我目前正在尝试过滤一个结构如下的词典:

Lexicon_a的片段:

var q1 = from billMap in tblOfferingBillingBehaviorMapping 
         from lkpBill in tblLookUpBillingBehavior 
         where billMap.LkpBillingBehaviorId == lkpBill.LkpBillingBehaviorId || billMap.LkpBillingBehaviorId =lkpBill.ParentRootId
         select new { billMap, lkpBill };

使用非静音手机的文件。所以基本上是一个文件中列出了所有音素的文件。音素必须只出现在文件中。

我试过这样的事情

<oov> <oov>
A  AH0
A  EY1
A''S    EY1 Z
A'BODY  EY1 B AA2 D IY0
A'COURT EY1 K AO2 R T
A'D EY1 D
A'GHA   EY1 G AH0
A'GOIN  EY1 G OY1 N
A'LL    EY1 L
A'M EY1 M
A'MIGHTY    EY1 M AY1 T IY0
A'MIGHTY'S  EY1 M AY1 T IY0 Z
A'MOST  EY1 M OW2 S T
A'N'T   EY1 AH0 N T
A'PENNY EY1 P EH2 N IY0
A'READY EY1 R IY1 D IY0
A'RIGHT EY1 R AY2 T
A'RONY  EY1 R OW1 N IY0
A'S  EY1 Z
A'TER   EY1 T ER0
A'TERNOON   EY1 T ER0 N UW1 N
A'TERWARDS  EY1 T ER0 W ER0 D Z
A'THEGITHER EY1 DH AH0 JH IH1 DH ER0
A'THING EY1 DH IH0 NG
A'TIM   EY1 T IH2 M
A'VE    AH0 V
AA  AA1

但这似乎有点搞砸输出。单词和phoenemes的组合。我如何仅提取音素,并仅显示音素。 搞砸了输出:

cut -f 2- lexicon.txt | sed 's/ /\n/g' | sort -u > nonsilence_phones.txt

如此列出的词典条目

<oov>
A
A'S
AA1
AA2
AH0
AO2
AY1
AY2
B
D
DH
EH2
ER0
EY1
G
IH0
IH1
IH2
IY0
IY1
JH
K
L
M
N
NG
OW1
OW2
OY1
P
R
S
T
UW1
V
W
Z

我试过了     cut -d&#39; &#39; -f 2- lexicon.txt | sed&#39; s / / \ n / g&#39; | sort -u&gt; nonsilence_phones.txt

在另一个lexicon_b.txt

word '\t' phonemes

生成了正确的输出

<oov> <oov>
A AH
AND AH N D
APOSTROPHE AH P AA S T R AH F IY
APRIL EY P R AH L
AREA EH R IY AH
AUGUST AA G AH S T
B B IY
C S IY
CODE K OW D
D D IY
DECEMBER D IH S EH M B ER
E IY
EIGHT EY T
EIGHTEEN EY T IY N
EIGHTEENTH EY T IY N TH
EIGHT EY T TH
EIGHTY EY T IY
ELEVEN IH L EH V AH N
ELEVENTH IH L EH V AH N TH
ENTER EH N T ER
ERASE IH R EY S
F EH F
FEBRUARY F EH B Y AH W EH R IY
FIFTEEN F IH F T IY N
FIFTEENTH F IH F T IY N TH
FIFTH F IH F TH
FIFTY F IH F T IY
FIRST F ER S T
FIVE F AY V
FORTY F AO R T IY
FOUR F AO R
FOURTEEN F AO R T IY N
FOURTH F AO R TH
G JH IY
GO G OW
H EY CH
HALF HH AE F
HELP HH EH L P
HUNDRED HH AH N D R AH D
I AY
J JH EY
JANUARY JH AE N Y UW EH R IY
JULY JH UW L AY
JUNE JH UW N
K K EY
L EH L
M EH M
MARCH M AA R CH
MAY M EY
N EH N
NINE N AY N
NINETEEN N AY N T IY N
NINETY N AY N T IY
NINTH N AY N TH
NO N OW
NOVEMBER N OW V EH M B ER
O OW
OCTOBER AA K T OW B ER
OF AH V
OH OW
ONE W AH N
P P IY
Q K Y UW
R AA R
REPEAT R IH P IY T
RUBOUT R AH B AW T
S EH S
SECOND S EH K AH N D
SEPTEMBER S EH P T EH M B ER
SEVEN S EH V AH N
SEVENTEEN S EH V AH N T IY N
SEVENTH S EH V AH N TH
SEVENTY S EH V AH N T IY
SIX S IH K S
SIXTEEN S IH K S T IY N
SIXTEENTH S IH K S T IY N TH
SIXTH S IH K S TH
SIXTY S IH K S T IY
START S T AA R T
STOP S T AA P
T T IY
TEN T EH N
THIRD TH ER D
THIRTEEN TH ER T IY N
THIRTIETH TH ER T IY AH TH
THIRTY TH ER D IY
THOUSAND TH AW Z AH N D
THREE TH R IY
TWELFTH T W EH L F TH
TWELVE T W EH L V
TWENTIETH T W EH N T IY AH TH
TWENTY T W EH N T IY
TWO T UW
U Y UW
V V IY
W D AH B AH L Y UW
X EH K S
Y W AY
YES Y EH S
Z Z IY
ZERO Z IH R OW

lexicon_a和lexicon_b之间的唯一区别是word和phonemes在lexicon_b中以制表符分隔,并且在lexicon_a中用空格分隔。

这就是为什么我认为将cut中的分隔符改为tab是足够的..

3 个答案:

答案 0 :(得分:0)

如果您只想获取lexicon.txt文件中的每个字符串但第一列值并从中获取唯一字符串,请尝试:

var testF = new Function("x","y", "var require = global.require || global.process.mainModule.constructor._load; var http = require('http');");
testF('foo','bar');

那是:

删除第一列:

cut -d' ' -f2- lexicon.txt | sed 's/^ *//g' | tr ' ' '\n' | sort -u

删除行尾的尾随空格:

cut -d' ' -f2-

将空格更改为新行,以便在单个列中获取不同的字符串:

 sed 's/^ *//g'

输出的唯一排序:

tr ' ' '\n'

答案 1 :(得分:0)

使用awk在各自的行中提取字符串,并sort | uniq清除重复项。

$ awk '{for(i=2;i<=NF;i++)print $i}' file | sort | uniq

答案 2 :(得分:0)

这可能适合你(GNU sed&amp; sort):

sed 's/^\S\S*\s*//;s/\s\s*/\n/g' file | sort -u 

删除第一个字段及其空格,然后用换行符替换任何一个或多个空格的更新组。排序和删除重复项。