Python NLTK有一个cmudict,用于吐出已识别单词的音素。例如'see' - > [u'S',u'IY1'],但对于无法识别的单词,则会出错。例如'seasee' - >错误。
import nltk
arpabet = nltk.corpus.cmudict.dict()
for word in ('s', 'see', 'sea', 'compute', 'comput', 'seesea'):
try:
print arpabet[word][0]
except Exception as e:
print e
#Output
[u'EH1', u'S']
[u'S', u'IY1']
[u'S', u'IY1']
[u'K', u'AH0', u'M', u'P', u'Y', u'UW1', u'T']
'comput'
'seesea'
是否有任何模块没有该限制但能够找到/猜测任何真实或化妆单词的音素?
如果没有,有什么方法可以编程吗?我正在考虑做循环来测试单词的增加部分。例如在'seasee'中,第一个循环取“s”,下一个循环取'se',第三个循环取'sea'......等运行cmudict。虽然问题是我不知道如何发出信号,这是正确的音素。例如,'seasee'中的's'和'sea'都会输出一些有效的音素。
工作进度:
import nltk
arpabet = nltk.corpus.cmudict.dict()
for word in ('s', 'see', 'sea', 'compute', 'comput', 'seesea', 'darfasasawwa'):
try:
phone = arpabet[word][0]
except:
try:
counter = 0
for i in word:
substring = word[0:1+counter]
counter += 1
try:
print substring, arpabet[substring][0]
except Exception as e:
print e
except Exception as e:
print e
#Output
c [u'S', u'IY1']
co [u'K', u'OW1']
com [u'K', u'AA1', u'M']
comp [u'K', u'AA1', u'M', u'P']
compu [u'K', u'AA1', u'M', u'P', u'Y', u'UW0']
comput 'comput'
s [u'EH1', u'S']
se [u'S', u'AW2', u'TH', u'IY1', u'S', u'T']
see [u'S', u'IY1']
sees [u'S', u'IY1', u'Z']
seese [u'S', u'IY1', u'Z']
seesea 'seesea'
d [u'D', u'IY1']
da [u'D', u'AA1']
dar [u'D', u'AA1', u'R']
darf 'darf'
darfa 'darfa'
darfas 'darfas'
darfasa 'darfasa'
darfasas 'darfasas'
darfasasa 'darfasasa'
darfasasaw 'darfasasaw'
darfasasaww 'darfasasaww'
darfasasawwa 'darfasasawwa'
答案 0 :(得分:3)
我遇到了同样的问题,我通过递归分区未知来解决它(参见wordbreak
)
import nltk
from functools import lru_cache
from itertools import product as iterprod
try:
arpabet = nltk.corpus.cmudict.dict()
except LookupError:
nltk.download('cmudict')
arpabet = nltk.corpus.cmudict.dict()
@lru_cache()
def wordbreak(s):
s = s.lower()
if s in arpabet:
return arpabet[s]
middle = len(s)/2
partition = sorted(list(range(len(s))), key=lambda x: (x-middle)**2-x)
for i in partition:
pre, suf = (s[:i], s[i:])
if pre in arpabet and wordbreak(suf) is not None:
return [x+y for x,y in iterprod(arpabet[pre], wordbreak(suf))]
return None
答案 1 :(得分:2)
您可以使用LOGIOS Lexicon Tool。这是您的示例的输出:
S EH S
SEE S IY
SEA S IY
COMPUTE K AH M P Y UW T
COMPUT K AH M P UH T
SEESEA S IY S IY
我不知道任何python实现,您可以尝试自己实现,或使用subprocess.call
答案 2 :(得分:1)
尝试发音模块:
https://pronouncing.readthedocs.io/en/latest/
示例:
pronouncing.phones_for_word(“ word”)
我希望这可行:)
答案 3 :(得分:0)
您可以使用g2p库
安装:
for (int i=0;i<max.length;i++){
System.out.print(max[i]);
if(max[i]>low ){
low = max[i];
if(low<=b){
soln=low;
}
}
}
System.out.println();
if(soln==0){
System.out.println(neg);
} else {
System.out.println(soln);
}
OR
pip install g2p_en
用法:
python setup.py install
答案 4 :(得分:0)
我刚刚完成了邓诺的回答。通过使用以下代码,您将获得与基于{strong> CMUdict 的LOGIOS Lexicon Tool 完全相同的结果。
import re
import pronouncing
text = "april is the cruelest month breeding lilacs out of the dead"
words = text.split()
WordToPhn=[]
for word in words:
pronunciation_list = pronouncing.phones_for_word(word)[0] # choose the first version of the phoneme
WordToPhn.append(pronunciation_list)
SentencePhn=' '.join(WordToPhn)
Output = re.sub(r'\d+', '', SentencePhn) #Remove the digits in phonemes
#SentencePhn: EY1 P R AH0 L IH1 Z DH AH0 K R UW1 L AH0 S T M AH1 N TH B R IY1 D IH0 NG L AY1 L AE2 K S AW1 T AH1 V DH AH0 D EH1 D
#Output:EY P R AH L IH Z DH AH K R UW L AH S T M AH N TH B R IY D IH NG L AY L AE K S AW T AH V DH AH D EH D
我在每个单词的音素之间使用了两个空格。如果您只希望像LOGIOS Lexicon Tool这样的空间,可以在这里将其更改为一个空间:
SentencePhn=' '.join(WordToPhn)
希望有帮助!