我需要找到一种相当有效的方法来检测单词中的音节。如,
隐形 - >在-VI-SIB乐
可以使用一些音节规则:
V 简历 虚电路 CVC CCV CCCV CVCC
*其中V是元音,C是辅音。 例如,
发音(5 Pro-nun-ci-a-tion; CV-CVC-CV-V-CVC)
我尝试了很少的方法,其中包括使用正则表达式(仅在你想要计算音节时有用)或硬编码规则定义(证明效率非常低的强力方法)并最终使用有限状态自动机(没有任何有用的结果)。
我的应用程序的目的是创建一个给定语言的所有音节的字典。该词典稍后将用于拼写检查应用程序(使用贝叶斯分类器)和文本到语音合成。
如果除了我以前的方法之外,我可以提供另一种方法来解决这个问题。
我在Java工作,但是C / C ++,C#,Python,Perl ......中的任何提示都适用于我。
答案 0 :(得分:111)
为了连字,请阅读有关此问题的TeX方法。特别是看看Frank Liang的thesis dissertation Word Hy-phen-a-by Com-put-er 。他的算法非常准确,然后包含一个小例外字典,用于算法不起作用的情况。
答案 1 :(得分:43)
我偶然发现了这个页面,寻找同样的东西,并在这里找到了梁文的一些实现: https://github.com/mnater/hyphenator
除非你喜欢阅读60页的论文,而不是为非独特的问题调整免费的可用代码。 :)
答案 2 :(得分:39)
以下是使用NLTK的解决方案:
from nltk.corpus import cmudict
d = cmudict.dict()
def nsyl(word):
return [len(list(y for y in x if y[-1].isdigit())) for x in d[word.lower()]]
答案 3 :(得分:18)
我正在尝试解决这个问题,该程序将计算一段文本的flesch-kincaid和flesch读数。我的算法使用了我在这个网站上找到的内容:http://www.howmanysyllables.com/howtocountsyllables.html并且它相当接近。它仍然在像隐形和连字符这样复杂的单词上遇到麻烦,但我发现它可以用于我的目的。
它具有易于实施的优点。我发现“es”既可以是音节也可以不是。这是一场赌博,但我决定在我的算法中删除es。
private int CountSyllables(string word)
{
char[] vowels = { 'a', 'e', 'i', 'o', 'u', 'y' };
string currentWord = word;
int numVowels = 0;
bool lastWasVowel = false;
foreach (char wc in currentWord)
{
bool foundVowel = false;
foreach (char v in vowels)
{
//don't count diphthongs
if (v == wc && lastWasVowel)
{
foundVowel = true;
lastWasVowel = true;
break;
}
else if (v == wc && !lastWasVowel)
{
numVowels++;
foundVowel = true;
lastWasVowel = true;
break;
}
}
//if full cycle and no vowel found, set lastWasVowel to false;
if (!foundVowel)
lastWasVowel = false;
}
//remove es, it's _usually? silent
if (currentWord.Length > 2 &&
currentWord.Substring(currentWord.Length - 2) == "es")
numVowels--;
// remove silent e
else if (currentWord.Length > 1 &&
currentWord.Substring(currentWord.Length - 1) == "e")
numVowels--;
return numVowels;
}
答案 4 :(得分:7)
这是一个特别困难的问题,LaTeX连字算法无法完全解决这个问题。可以在文章Evaluating Automatic Syllabification Algorithms for English(Marchand,Adsett和Damper 2007)中找到一些可用方法和所涉及挑战的一个很好的总结。
答案 5 :(得分:5)
感谢Joe Basirico,感谢您在C#中分享快速而肮脏的实现。我使用过大型图书馆,他们工作,但他们通常有点慢,对于快速项目,你的方法很好。
以下是Java中的代码以及测试用例:
public static int countSyllables(String word)
{
char[] vowels = { 'a', 'e', 'i', 'o', 'u', 'y' };
char[] currentWord = word.toCharArray();
int numVowels = 0;
boolean lastWasVowel = false;
for (char wc : currentWord) {
boolean foundVowel = false;
for (char v : vowels)
{
//don't count diphthongs
if ((v == wc) && lastWasVowel)
{
foundVowel = true;
lastWasVowel = true;
break;
}
else if (v == wc && !lastWasVowel)
{
numVowels++;
foundVowel = true;
lastWasVowel = true;
break;
}
}
// If full cycle and no vowel found, set lastWasVowel to false;
if (!foundVowel)
lastWasVowel = false;
}
// Remove es, it's _usually? silent
if (word.length() > 2 &&
word.substring(word.length() - 2) == "es")
numVowels--;
// remove silent e
else if (word.length() > 1 &&
word.substring(word.length() - 1) == "e")
numVowels--;
return numVowels;
}
public static void main(String[] args) {
String txt = "what";
System.out.println("txt="+txt+" countSyllables="+countSyllables(txt));
txt = "super";
System.out.println("txt="+txt+" countSyllables="+countSyllables(txt));
txt = "Maryland";
System.out.println("txt="+txt+" countSyllables="+countSyllables(txt));
txt = "American";
System.out.println("txt="+txt+" countSyllables="+countSyllables(txt));
txt = "disenfranchized";
System.out.println("txt="+txt+" countSyllables="+countSyllables(txt));
txt = "Sophia";
System.out.println("txt="+txt+" countSyllables="+countSyllables(txt));
}
结果如预期的那样(它对Flesch-Kincaid来说足够好):
txt=what countSyllables=1
txt=super countSyllables=2
txt=Maryland countSyllables=3
txt=American countSyllables=3
txt=disenfranchized countSyllables=5
txt=Sophia countSyllables=2
答案 6 :(得分:5)
撞击@Tihamer和@ joe-basirico。非常有用的功能,不是完美,但对大多数中小型项目都有好处。 Joe,我用Python重写了你的代码实现:
def countSyllables(word):
vowels = "aeiouy"
numVowels = 0
lastWasVowel = False
for wc in word:
foundVowel = False
for v in vowels:
if v == wc:
if not lastWasVowel: numVowels+=1 #don't count diphthongs
foundVowel = lastWasVowel = True
break
if not foundVowel: #If full cycle and no vowel found, set lastWasVowel to false
lastWasVowel = False
if len(word) > 2 and word[-2:] == "es": #Remove es - it's "usually" silent (?)
numVowels-=1
elif len(word) > 1 and word[-1:] == "e": #remove silent e
numVowels-=1
return numVowels
希望有人觉得这很有用!
答案 7 :(得分:4)
Perl有Lingua::Phonology::Syllable个模块。您可以尝试,或尝试查看其算法。我也看到了其他几个较旧的模块。
我不明白为什么正则表达式只给你一个音节数。您应该能够使用捕获括号自己获取音节。假设您可以构造一个有效的正则表达式,即。
答案 8 :(得分:4)
今天我发现this Java实现了Frank Liang的连字算法,其中包含英语或德语模式,效果很好,可以在Maven Central上使用。
洞穴:删除.tex
模式文件的最后几行很重要,因为否则这些文件无法在Maven Central上加载当前版本。
要加载和使用hyphenator
,您可以使用以下Java代码段。 texTable
是包含所需模式的.tex
个文件的名称。这些文件可以在项目github网站上找到。
private Hyphenator createHyphenator(String texTable) {
Hyphenator hyphenator = new Hyphenator();
hyphenator.setErrorHandler(new ErrorHandler() {
public void debug(String guard, String s) {
logger.debug("{},{}", guard, s);
}
public void info(String s) {
logger.info(s);
}
public void warning(String s) {
logger.warn("WARNING: " + s);
}
public void error(String s) {
logger.error("ERROR: " + s);
}
public void exception(String s, Exception e) {
logger.error("EXCEPTION: " + s, e);
}
public boolean isDebugged(String guard) {
return false;
}
});
BufferedReader table = null;
try {
table = new BufferedReader(new InputStreamReader(Thread.currentThread().getContextClassLoader()
.getResourceAsStream((texTable)), Charset.forName("UTF-8")));
hyphenator.loadTable(table);
} catch (Utf8TexParser.TexParserException e) {
logger.error("error loading hyphenation table: {}", e.getLocalizedMessage(), e);
throw new RuntimeException("Failed to load hyphenation table", e);
} finally {
if (table != null) {
try {
table.close();
} catch (IOException e) {
logger.error("Closing hyphenation table failed", e);
}
}
}
return hyphenator;
}
之后Hyphenator
即可使用。要检测音节,基本思路是将术语拆分为提供的连字符。
String hyphenedTerm = hyphenator.hyphenate(term);
String hyphens[] = hyphenedTerm.split("\u00AD");
int syllables = hyphens.length;
您需要拆分"\u00AD
“,因为API不会返回正常的"-"
。
这种方法优于Joe Basirico的答案,因为它支持许多不同的语言,并且检测德语连字更准确。
答案 9 :(得分:3)
为什么计算它?每个在线词典都有这个信息。 http://dictionary.reference.com/browse/invisible 在可见··I·BLE
答案 10 :(得分:2)
我找不到足够的方法来计算音节,所以我自己设计了一种方法。
您可以在此处查看我的方法:filing a new issue on github
我使用字典和算法方法的组合来计算音节。
您可以在此处查看我的图书馆:https://stackoverflow.com/a/32784041/2734752
我刚刚测试了我的算法,得分率达到了99.4%!
Lawrence lawrence = new Lawrence();
System.out.println(lawrence.getSyllable("hyphenation"));
System.out.println(lawrence.getSyllable("computer"));
输出:
4
3
答案 11 :(得分:1)
你可以试试Spacy Syllables。这适用于 Python 3.9:
设置:
pip install spacy
pip install spacy_syllables
python -m spacy download en_core_web_md
代码:
import spacy
from spacy_syllables import SpacySyllables
nlp = spacy.load('en_core_web_md')
syllables = SpacySyllables(nlp)
nlp.add_pipe('syllables', after='tagger')
def spacy_syllablize(word):
token = nlp(word)[0]
return token._.syllables
for test_word in ["trampoline", "margaret", "invisible", "thought", "Pronunciation", "couldn't"]:
print(f"{test_word} -> {spacy_syllablize(test_word)}")
输出:
trampoline -> ['tram', 'po', 'line']
margaret -> ['mar', 'garet']
invisible -> ['in', 'vis', 'i', 'ble']
thought -> ['thought']
Pronunciation -> ['pro', 'nun', 'ci', 'a', 'tion']
couldn't -> ['could']
答案 12 :(得分:0)
不久前,我遇到了这个完全相同的问题。
我最终使用CMU Pronunciation Dictionary快速准确地查询了大多数单词。对于字典中没有的单词,我退回到了机器学习模型,该模型在预测音节数方面的准确度约为98%。
我将整个内容包装在一个易于使用的python模块中:https://github.com/repp/big-phoney
安装:
pip install big-phoney
计数音节:
from big_phoney import BigPhoney
phoney = BigPhoney()
phoney.count_syllables('triceratops') # --> 4
如果您不使用Python,并且想尝试基于ML模型的方法,那么我会做一个非常详细的write up on how the syllable counting model works on Kaggle。
答案 13 :(得分:0)
在进行了许多测试并尝试了断字连接程序包之后,我根据许多示例编写了自己的代码。我还尝试了pyhyphen
和pyphen
包,这些包与连字词典对接,但在许多情况下它们产生的音节数错误。对于该用例,nltk
软件包太慢了。
我在Python中的实现是我编写的类的一部分,音节计数例程粘贴在下面。由于我仍然找不到解决静默单词结尾的好方法,因此它高估了一些音节的数量。
该函数返回用于Flesch-Kincaid可读性评分的每个单词的音节比率。该数字不一定是准确的,只需足够接近即可估算。
在我的第7代i7 CPU上,此功能花了1.1-1.2毫秒来获取759个单词的示例文本。
def _countSyllablesEN(self, theText):
cleanText = ""
for ch in theText:
if ch in "abcdefghijklmnopqrstuvwxyz'’":
cleanText += ch
else:
cleanText += " "
asVow = "aeiouy'’"
dExep = ("ei","ie","ua","ia","eo")
theWords = cleanText.lower().split()
allSylls = 0
for inWord in theWords:
nChar = len(inWord)
nSyll = 0
wasVow = False
wasY = False
if nChar == 0:
continue
if inWord[0] in asVow:
nSyll += 1
wasVow = True
wasY = inWord[0] == "y"
for c in range(1,nChar):
isVow = False
if inWord[c] in asVow:
nSyll += 1
isVow = True
if isVow and wasVow:
nSyll -= 1
if isVow and wasY:
nSyll -= 1
if inWord[c:c+2] in dExep:
nSyll += 1
wasVow = isVow
wasY = inWord[c] == "y"
if inWord.endswith(("e")):
nSyll -= 1
if inWord.endswith(("le","ea","io")):
nSyll += 1
if nSyll < 1:
nSyll = 1
# print("%-15s: %d" % (inWord,nSyll))
allSylls += nSyll
return allSylls/len(theWords)
答案 14 :(得分:0)
我提供的解决方案在R中可以“正常”运行。远非完美。
countSyllablesInWord = function(words)
{
#word = "super";
n.words = length(words);
result = list();
for(j in 1:n.words)
{
word = words[j];
vowels = c("a","e","i","o","u","y");
word.vec = strsplit(word,"")[[1]];
word.vec;
n.char = length(word.vec);
is.vowel = is.element(tolower(word.vec), vowels);
n.vowels = sum(is.vowel);
# nontrivial problem
if(n.vowels <= 1)
{
syllables = 1;
str = word;
} else {
# syllables = 0;
previous = "C";
# on average ?
str = "";
n.hyphen = 0;
for(i in 1:n.char)
{
my.char = word.vec[i];
my.vowel = is.vowel[i];
if(my.vowel)
{
if(previous == "C")
{
if(i == 1)
{
str = paste0(my.char, "-");
n.hyphen = 1 + n.hyphen;
} else {
if(i < n.char)
{
if(n.vowels > (n.hyphen + 1))
{
str = paste0(str, my.char, "-");
n.hyphen = 1 + n.hyphen;
} else {
str = paste0(str, my.char);
}
} else {
str = paste0(str, my.char);
}
}
# syllables = 1 + syllables;
previous = "V";
} else { # "VV"
# assume what ? vowel team?
str = paste0(str, my.char);
}
} else {
str = paste0(str, my.char);
previous = "C";
}
#
}
syllables = 1 + n.hyphen;
}
result[[j]] = list("syllables" = syllables, "vowels" = n.vowels, "word" = str);
}
if(n.words == 1) { result[[1]]; } else { result; }
}
以下是一些结果:
my.count = countSyllablesInWord(c("America", "beautiful", "spacious", "skies", "amber", "waves", "grain", "purple", "mountains", "majesty"));
my.count.df = data.frame(matrix(unlist(my.count), ncol=3, byrow=TRUE));
colnames(my.count.df) = names(my.count[[1]]);
my.count.df;
# syllables vowels word
# 1 4 4 A-me-ri-ca
# 2 4 5 be-auti-fu-l
# 3 3 4 spa-ci-ous
# 4 2 2 ski-es
# 5 2 2 a-mber
# 6 2 2 wa-ves
# 7 2 2 gra-in
# 8 2 2 pu-rple
# 9 3 4 mo-unta-ins
# 10 3 3 ma-je-sty
我没有意识到这看起来有多大“兔子洞”。
################ hackathon #######
# https://en.wikipedia.org/wiki/Gunning_fog_index
# THIS is a CLASSIFIER PROBLEM ...
# https://stackoverflow.com/questions/405161/detecting-syllables-in-a-word
# http://www.speech.cs.cmu.edu/cgi-bin/cmudict
# http://www.syllablecount.com/syllables/
# https://enchantedlearning.com/consonantblends/index.shtml
# start.digraphs = c("bl", "br", "ch", "cl", "cr", "dr",
# "fl", "fr", "gl", "gr", "pl", "pr",
# "sc", "sh", "sk", "sl", "sm", "sn",
# "sp", "st", "sw", "th", "tr", "tw",
# "wh", "wr");
# start.trigraphs = c("sch", "scr", "shr", "sph", "spl",
# "spr", "squ", "str", "thr");
#
#
#
# end.digraphs = c("ch","sh","th","ng","dge","tch");
#
# ile
#
# farmer
# ar er
#
# vowel teams ... beaver1
#
#
# # "able"
# # http://www.abcfastphonics.com/letter-blends/blend-cial.html
# blends = c("augh", "ough", "tien", "ture", "tion", "cial", "cian",
# "ck", "ct", "dge", "dis", "ed", "ex", "ful",
# "gh", "ng", "ous", "kn", "ment", "mis", );
#
# glue = c("ld", "st", "nd", "ld", "ng", "nk",
# "lk", "lm", "lp", "lt", "ly", "mp", "nce", "nch",
# "nse", "nt", "ph", "psy", "pt", "re", )
#
#
# start.graphs = c("bl, br, ch, ck, cl, cr, dr, fl, fr, gh, gl, gr, ng, ph, pl, pr, qu, sc, sh, sk, sl, sm, sn, sp, st, sw, th, tr, tw, wh, wr");
#
# # https://mantra4changeblog.wordpress.com/2017/05/01/consonant-digraphs/
# digraphs.start = c("ch","sh","th","wh","ph","qu");
# digraphs.end = c("ch","sh","th","ng","dge","tch");
# # https://www.education.com/worksheet/article/beginning-consonant-blends/
# blends.start = c("pl", "gr", "gl", "pr",
#
# blends.end = c("lk","nk","nt",
#
#
# # https://sarahsnippets.com/wp-content/uploads/2019/07/ScreenShot2019-07-08at8.24.51PM-817x1024.png
# # Monte Mon-te
# # Sophia So-phi-a
# # American A-mer-i-can
#
# n.vowels = 0;
# for(i in 1:n.char)
# {
# my.char = word.vec[i];
#
#
#
#
#
# n.syll = 0;
# str = "";
#
# previous = "C"; # consonant vs "V" vowel
#
# for(i in 1:n.char)
# {
# my.char = word.vec[i];
#
# my.vowel = is.element(tolower(my.char), vowels);
# if(my.vowel)
# {
# n.vowels = 1 + n.vowels;
# if(previous == "C")
# {
# if(i == 1)
# {
# str = paste0(my.char, "-");
# } else {
# if(n.syll > 1)
# {
# str = paste0(str, "-", my.char);
# } else {
# str = paste0(str, my.char);
# }
# }
# n.syll = 1 + n.syll;
# previous = "V";
# }
#
# } else {
# str = paste0(str, my.char);
# previous = "C";
# }
# #
# }
#
#
#
#
## https://jzimba.blogspot.com/2017/07/an-algorithm-for-counting-syllables.html
# AIDE 1
# IDEA 3
# IDEAS 2
# IDEE 2
# IDE 1
# AIDA 2
# PROUSTIAN 3
# CHRISTIAN 3
# CLICHE 1
# HALIDE 2
# TELEPHONE 3
# TELEPHONY 4
# DUE 1
# IDEAL 2
# DEE 1
# UREA 3
# VACUO 3
# SEANCE 1
# SAILED 1
# RIBBED 1
# MOPED 1
# BLESSED 1
# AGED 1
# TOTED 2
# WARRED 1
# UNDERFED 2
# JADED 2
# INBRED 2
# BRED 1
# RED 1
# STATES 1
# TASTES 1
# TESTES 1
# UTILIZES 4
从本质上讲,一个简单的kincaid可读性功能...音节是从第一个功能返回的计数列表...
由于我的功能偏向于更多的音节,因此可读性得分会有所提高……目前还不错……如果目标是使文本更具可读性,这并不是最糟糕的事情。 / p>
computeReadability = function(n.sentences, n.words, syllables=NULL)
{
n = length(syllables);
n.syllables = 0;
for(i in 1:n)
{
my.syllable = syllables[[i]];
n.syllables = my.syllable$syllables + n.syllables;
}
# Flesch Reading Ease (FRE):
FRE = 206.835 - 1.015 * (n.words/n.sentences) - 84.6 * (n.syllables/n.words);
# Flesh-Kincaid Grade Level (FKGL):
FKGL = 0.39 * (n.words/n.sentences) + 11.8 * (n.syllables/n.words) - 15.59;
# FKGL = -0.384236 * FRE - 20.7164 * (n.syllables/n.words) + 63.88355;
# FKGL = -0.13948 * FRE + 0.24843 * (n.words/n.sentences) + 13.25934;
list("FRE" = FRE, "FKGL" = FKGL);
}
答案 15 :(得分:-1)
我用jsoup做了一次。这是一个示例音节解析器:
public String[] syllables(String text){
String url = "https://www.merriam-webster.com/dictionary/" + text;
String relHref;
try{
Document doc = Jsoup.connect(url).get();
Element link = doc.getElementsByClass("word-syllables").first();
if(link == null){return new String[]{text};}
relHref = link.html();
}catch(IOException e){
relHref = text;
}
String[] syl = relHref.split("·");
return syl;
}