我写了这个PHP代码来实现Flesch-Kincaid可读性分数作为函数:
function readability($text) {
$total_sentences = 1; // one full stop = two sentences => start with 1
$punctuation_marks = array('.', '?', '!', ':');
foreach ($punctuation_marks as $punctuation_mark) {
$total_sentences += substr_count($text, $punctuation_mark);
}
$total_words = str_word_count($text);
$total_syllable = 3; // assuming this value since I don't know how to count them
$score = 206.835-(1.015*$total_words/$total_sentences)-(84.6*$total_syllables/$total_words);
return $score;
}
您对如何改进代码有什么建议吗?这是对的吗?它会起作用吗?
我希望你能帮助我。提前谢谢!
答案 0 :(得分:17)
就启发式而言,代码看起来很好。以下是需要考虑的一些要点,使计算机所需的项目相当困难:
什么是句子?
说真的,这句话是什么?我们有句号,但它们也可以用于博士学位,例如,Y.M.C.A。和其他非句子最终目的。当你考虑感叹号,问号和省略号时,假设一段时间可以解决这个问题,你真的在做自己的伤害。我之前看过这个问题,如果你真的想要在真实文本中更可靠地计算句子,你需要解析文本。这可能是计算密集型,耗时且难以找到的免费资源。最后,您仍然需要担心特定解析器实现的错误率。但是,只有完整的解析才能告诉你什么是句子,什么只是句号的其他许多用途。此外,如果您正在使用“野外”文本(例如HTML),您还必须担心句子不是以标点符号结尾,而是标记结尾。例如,许多网站没有为h1和h2标签添加标点符号,但它们显然是不同的句子或短语。
音节不是我们应该接近的东西
这是这种可读性启发式的主要标志,而且它是最难实现的。对作品中音节计数的计算分析需要假设假设读者使用与您的音节计数生成器正在训练的方言相同的方言。声音如何围绕一个音节实际上是口音重音的主要部分。如果您不相信我,请尝试访问牙买加。这意味着,即使一个人手动进行计算,它仍然是一个方言特定的分数。
这是一个什么词?
不要一丝不苟地打蜡心灵,但你会发现空间分离的单词和被概念化为说话者的单词是完全不同的。这将使可计算可读性分数的概念有些可疑。
所以最后,我可以回答你的问题“它会起作用吗”。如果您希望获取一段文本并在其他指标中显示此可读性分数以提供某种可能的附加价值,那么挑剔的用户将不会提出所有这些问题。如果你正在尝试做一些科学的东西,或者甚至是一些教学方法(因为这个分数和那些最终的意图),我真的不会打扰。事实上,如果您打算使用此功能向用户提出有关他们所生成内容的任何建议,我会非常犹豫。
衡量文本阅读难度的更好方法更可能是与低频词与高频词的比例以及文本中hapax legomena的数量有关。但是我不会追求这样的启发式,因为对它进行经验测试是非常困难的。
答案 1 :(得分:8)
查看GitHub上的PHP Text Statistics课程。
答案 2 :(得分:6)
请查看以下两个课程及其使用信息。它一定会帮到你。
可读性音节计数模式库类:
<?php class ReadabilitySyllableCheckPattern {
public $probWords = [
'abalone' => 4,
'abare' => 3,
'abed' => 2,
'abruzzese' => 4,
'abbruzzese' => 4,
'aborigine' => 5,
'acreage' => 3,
'adame' => 3,
'adieu' => 2,
'adobe' => 3,
'anemone' => 4,
'apache' => 3,
'aphrodite' => 4,
'apostrophe' => 4,
'ariadne' => 4,
'cafe' => 2,
'calliope' => 4,
'catastrophe' => 4,
'chile' => 2,
'chloe' => 2,
'circe' => 2,
'coyote' => 3,
'epitome' => 4,
'forever' => 3,
'gethsemane' => 4,
'guacamole' => 4,
'hyperbole' => 4,
'jesse' => 2,
'jukebox' => 2,
'karate' => 3,
'machete' => 3,
'maybe' => 2,
'people' => 2,
'recipe' => 3,
'sesame' => 3,
'shoreline' => 2,
'simile' => 3,
'syncope' => 3,
'tamale' => 3,
'yosemite' => 4,
'daphne' => 2,
'eurydice' => 4,
'euterpe' => 3,
'hermione' => 4,
'penelope' => 4,
'persephone' => 4,
'phoebe' => 2,
'zoe' => 2
];
public $addSyllablePatterns = [
"([^s]|^)ia",
"iu",
"io",
"eo($|[b-df-hj-np-tv-z])",
"ii",
"[ou]a$",
"[aeiouym]bl$",
"[aeiou]{3}",
"[aeiou]y[aeiou]",
"^mc",
"ism$",
"asm$",
"thm$",
"([^aeiouy])\1l$",
"[^l]lien",
"^coa[dglx].",
"[^gq]ua[^auieo]",
"dnt$",
"uity$",
"[^aeiouy]ie(r|st|t)$",
"eings?$",
"[aeiouy]sh?e[rsd]$",
"iell",
"dea$",
"real",
"[^aeiou]y[ae]",
"gean$",
"riet",
"dien",
"uen"
];
public $prefixSuffixPatterns = [
"^un",
"^fore",
"^ware",
"^none?",
"^out",
"^post",
"^sub",
"^pre",
"^pro",
"^dis",
"^side",
"ly$",
"less$",
"some$",
"ful$",
"ers?$",
"ness$",
"cians?$",
"ments?$",
"ettes?$",
"villes?$",
"ships?$",
"sides?$",
"ports?$",
"shires?$",
"tion(ed)?$"
];
public $subSyllablePatterns = [
"cia(l|$)",
"tia",
"cius",
"cious",
"[^aeiou]giu",
"[aeiouy][^aeiouy]ion",
"iou",
"sia$",
"eous$",
"[oa]gue$",
".[^aeiuoycgltdb]{2,}ed$",
".ely$",
"^jua",
"uai",
"eau",
"[aeiouy](b|c|ch|d|dg|f|g|gh|gn|k|l|ll|lv|m|mm|n|nc|ng|nn|p|r|rc|rn|rs|rv|s|sc|sk|sl|squ|ss|st|t|th|v|y|z)e$",
"[aeiouy](b|c|ch|dg|f|g|gh|gn|k|l|lch|ll|lv|m|mm|n|nc|ng|nch|nn|p|r|rc|rn|rs|rv|s|sc|sk|sl|squ|ss|th|v|y|z)ed$",
"[aeiouy](b|ch|d|f|gh|gn|k|l|lch|ll|lv|m|mm|n|nch|nn|p|r|rn|rs|rv|s|sc|sk|sl|squ|ss|st|t|th|v|y)es$",
"^busi$"
]; } ?>
另一类是可读性算法类,有两种计算得分的方法:
<?php class ReadabilityAlgorithm {
function countSyllable($strWord) {
$pattern = new ReadabilitySyllableCheckPattern();
$strWord = trim($strWord);
// Check for problem words
if (isset($pattern->{'probWords'}[$strWord])) {
return $pattern->{'probWords'}[$strWord];
}
// Check prefix, suffix
$strWord = str_replace($pattern->{'prefixSuffixPatterns'}, '', $strWord, $tmpPrefixSuffixCount);
// Removed non word characters from word
$arrWordParts = preg_split('`[^aeiouy]+`', $strWord);
$wordPartCount = 0;
foreach ($arrWordParts as $strWordPart) {
if ($strWordPart <> '') {
$wordPartCount++;
}
}
$intSyllableCount = $wordPartCount + $tmpPrefixSuffixCount;
// Check syllable patterns
foreach ($pattern->{'subSyllablePatterns'} as $strSyllable) {
$intSyllableCount -= preg_match('`' . $strSyllable . '`', $strWord);
}
foreach ($pattern->{'addSyllablePatterns'} as $strSyllable) {
$intSyllableCount += preg_match('`' . $strSyllable . '`', $strWord);
}
$intSyllableCount = ($intSyllableCount == 0) ? 1 : $intSyllableCount;
return $intSyllableCount;
}
function calculateReadabilityScore($stringText) {
# Calculate score
$totalSentences = 1;
$punctuationMarks = array('.', '!', ':', ';');
foreach ($punctuationMarks as $punctuationMark) {
$totalSentences += substr_count($stringText, $punctuationMark);
}
// get ASL value
$totalWords = str_word_count($stringText);
$ASL = $totalWords / $totalSentences;
// find syllables value
$syllableCount = 0;
$arrWords = explode(' ', $stringText);
$intWordCount = count($arrWords);
//$intWordCount = $totalWords;
for ($i = 0; $i < $intWordCount; $i++) {
$syllableCount += $this->countSyllable($arrWords[$i]);
}
// get ASW value
$ASW = $syllableCount / $totalWords;
// Count the readability score
$score = 206.835 - (1.015 * $ASL) - (84.6 * $ASW);
return $score;
} } ?>
//示例:如何使用
<?php // Create object to count readability score
$readObj = new ReadabilityAlgorithm();
echo $readObj->calculateReadabilityScore("Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into: electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently; with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum!");
?>
答案 3 :(得分:0)
我实际上没有看到该代码有任何问题。当然,如果您真的想用一个计数循环替换所有不同的函数,它可以进行一些优化。但是,我强烈认为这不是必要的,甚至是完全错误的。您当前的代码非常易读且易于理解,从这个角度来看,任何优化都可能使事情变得更糟。按原样使用它,不要试图优化它,除非它实际上是性能瓶颈。