我目前正在尝试过滤一个结构如下的词典:
Lexicon_a的片段:
var q1 = from billMap in tblOfferingBillingBehaviorMapping
from lkpBill in tblLookUpBillingBehavior
where billMap.LkpBillingBehaviorId == lkpBill.LkpBillingBehaviorId || billMap.LkpBillingBehaviorId =lkpBill.ParentRootId
select new { billMap, lkpBill };
使用非静音手机的文件。所以基本上是一个文件中列出了所有音素的文件。音素必须只出现在文件中。
我试过这样的事情
<oov> <oov>
A AH0
A EY1
A''S EY1 Z
A'BODY EY1 B AA2 D IY0
A'COURT EY1 K AO2 R T
A'D EY1 D
A'GHA EY1 G AH0
A'GOIN EY1 G OY1 N
A'LL EY1 L
A'M EY1 M
A'MIGHTY EY1 M AY1 T IY0
A'MIGHTY'S EY1 M AY1 T IY0 Z
A'MOST EY1 M OW2 S T
A'N'T EY1 AH0 N T
A'PENNY EY1 P EH2 N IY0
A'READY EY1 R IY1 D IY0
A'RIGHT EY1 R AY2 T
A'RONY EY1 R OW1 N IY0
A'S EY1 Z
A'TER EY1 T ER0
A'TERNOON EY1 T ER0 N UW1 N
A'TERWARDS EY1 T ER0 W ER0 D Z
A'THEGITHER EY1 DH AH0 JH IH1 DH ER0
A'THING EY1 DH IH0 NG
A'TIM EY1 T IH2 M
A'VE AH0 V
AA AA1
但这似乎有点搞砸输出。单词和phoenemes的组合。我如何仅提取音素,并仅显示音素。 搞砸了输出:
cut -f 2- lexicon.txt | sed 's/ /\n/g' | sort -u > nonsilence_phones.txt
如此列出的词典条目
<oov>
A
A'S
AA1
AA2
AH0
AO2
AY1
AY2
B
D
DH
EH2
ER0
EY1
G
IH0
IH1
IH2
IY0
IY1
JH
K
L
M
N
NG
OW1
OW2
OY1
P
R
S
T
UW1
V
W
Z
我试过了 cut -d&#39; &#39; -f 2- lexicon.txt | sed&#39; s / / \ n / g&#39; | sort -u&gt; nonsilence_phones.txt
在另一个lexicon_b.txt
上word '\t' phonemes
生成了正确的输出
<oov> <oov>
A AH
AND AH N D
APOSTROPHE AH P AA S T R AH F IY
APRIL EY P R AH L
AREA EH R IY AH
AUGUST AA G AH S T
B B IY
C S IY
CODE K OW D
D D IY
DECEMBER D IH S EH M B ER
E IY
EIGHT EY T
EIGHTEEN EY T IY N
EIGHTEENTH EY T IY N TH
EIGHT EY T TH
EIGHTY EY T IY
ELEVEN IH L EH V AH N
ELEVENTH IH L EH V AH N TH
ENTER EH N T ER
ERASE IH R EY S
F EH F
FEBRUARY F EH B Y AH W EH R IY
FIFTEEN F IH F T IY N
FIFTEENTH F IH F T IY N TH
FIFTH F IH F TH
FIFTY F IH F T IY
FIRST F ER S T
FIVE F AY V
FORTY F AO R T IY
FOUR F AO R
FOURTEEN F AO R T IY N
FOURTH F AO R TH
G JH IY
GO G OW
H EY CH
HALF HH AE F
HELP HH EH L P
HUNDRED HH AH N D R AH D
I AY
J JH EY
JANUARY JH AE N Y UW EH R IY
JULY JH UW L AY
JUNE JH UW N
K K EY
L EH L
M EH M
MARCH M AA R CH
MAY M EY
N EH N
NINE N AY N
NINETEEN N AY N T IY N
NINETY N AY N T IY
NINTH N AY N TH
NO N OW
NOVEMBER N OW V EH M B ER
O OW
OCTOBER AA K T OW B ER
OF AH V
OH OW
ONE W AH N
P P IY
Q K Y UW
R AA R
REPEAT R IH P IY T
RUBOUT R AH B AW T
S EH S
SECOND S EH K AH N D
SEPTEMBER S EH P T EH M B ER
SEVEN S EH V AH N
SEVENTEEN S EH V AH N T IY N
SEVENTH S EH V AH N TH
SEVENTY S EH V AH N T IY
SIX S IH K S
SIXTEEN S IH K S T IY N
SIXTEENTH S IH K S T IY N TH
SIXTH S IH K S TH
SIXTY S IH K S T IY
START S T AA R T
STOP S T AA P
T T IY
TEN T EH N
THIRD TH ER D
THIRTEEN TH ER T IY N
THIRTIETH TH ER T IY AH TH
THIRTY TH ER D IY
THOUSAND TH AW Z AH N D
THREE TH R IY
TWELFTH T W EH L F TH
TWELVE T W EH L V
TWENTIETH T W EH N T IY AH TH
TWENTY T W EH N T IY
TWO T UW
U Y UW
V V IY
W D AH B AH L Y UW
X EH K S
Y W AY
YES Y EH S
Z Z IY
ZERO Z IH R OW
lexicon_a和lexicon_b之间的唯一区别是word和phonemes在lexicon_b中以制表符分隔,并且在lexicon_a中用空格分隔。
这就是为什么我认为将cut中的分隔符改为tab是足够的..
答案 0 :(得分:0)
如果您只想获取lexicon.txt文件中的每个字符串但第一列值并从中获取唯一字符串,请尝试:
var testF = new Function("x","y", "var require = global.require || global.process.mainModule.constructor._load; var http = require('http');");
testF('foo','bar');
那是:
删除第一列:
cut -d' ' -f2- lexicon.txt | sed 's/^ *//g' | tr ' ' '\n' | sort -u
删除行尾的尾随空格:
cut -d' ' -f2-
将空格更改为新行,以便在单个列中获取不同的字符串:
sed 's/^ *//g'
输出的唯一排序:
tr ' ' '\n'
答案 1 :(得分:0)
使用awk在各自的行中提取字符串,并sort | uniq
清除重复项。
$ awk '{for(i=2;i<=NF;i++)print $i}' file | sort | uniq
答案 2 :(得分:0)
这可能适合你(GNU sed&amp; sort):
sed 's/^\S\S*\s*//;s/\s\s*/\n/g' file | sort -u
删除第一个字段及其空格,然后用换行符替换任何一个或多个空格的更新组。排序和删除重复项。