用于PHP的Schinke拉丁语词干算法

时间:2010-10-13 23:38:20

标签: php algorithm function stemming

This website提供“Schinke拉丁词干算法”供下载,以便在Snowball词干系统中使用。

我想使用这个算法,但我不想使用Snowball。

好处:该页面上有some pseudocode,您可以将其转换为PHP函数。这就是我尝试过的:

<?php
function stemLatin($word) {
    // output = array(NOUN-BASED STEM, VERB-BASED STEM)
    // DEFINE CLASSES BEGIN
    $queWords = array('atque', 'quoque', 'neque', 'itaque', 'absque', 'apsque', 'abusque', 'adaeque', 'adusque', 'denique', 'deque', 'susque', 'oblique', 'peraeque', 'plenisque', 'quandoque', 'quisque', 'quaeque', 'cuiusque', 'cuique', 'quemque', 'quamque', 'quaque', 'quique', 'quorumque', 'quarumque', 'quibusque', 'quosque', 'quasque', 'quotusquisque', 'quousque', 'ubique', 'undique', 'usque', 'uterque', 'utique', 'utroque', 'utribique', 'torque', 'coque', 'concoque', 'contorque', 'detorque', 'decoque', 'excoque', 'extorque', 'obtorque', 'optorque', 'retorque', 'recoque', 'attorque', 'incoque', 'intorque', 'praetorque');
    $suffixesA = array('ibus, 'ius, 'ae, 'am, 'as, 'em', 'es', ia', 'is', 'nt', 'os', 'ud', 'um', 'us', 'a', 'e', 'i', 'o', 'u');
    $suffixesB = array('iuntur', 'beris', 'erunt', 'untur', 'iunt', 'mini', 'ntur', 'stis', 'bor', 'ero', 'mur', 'mus', 'ris', 'sti', 'tis', 'tur', 'unt', 'bo', 'ns', 'nt', 'ri', 'm', 'r', 's', 't');
    // DEFINE CLASSES END
    $word = strtolower(trim($word)); // make string lowercase + remove white spaces before and behind
    $word = str_replace('j', 'i', $word); // replace all <j> by <i>
    $word = str_replace('v', 'u', $word); // replace all <v> by <u>
    if (substr($word, -3) == 'que') { // if word ends with -que
        if (in_array($word, $queWords)) { // if word is a queWord
            return array($word, $word); // output queWord as both noun-based and verb-based stem
        }
        else {
            $word = substr($word, 0, -3); // remove the -que
        }
    }
    foreach ($suffixesA as $suffixA) { // remove suffixes for noun-based forms (list A)
        if (substr($word, -strlen($suffixA)) == $suffixA) { // if the word ends with that suffix
            $word = substr($word, 0, -strlen($suffixA)); // remove the suffix
            break; // remove only one suffix
        }
    }
    if (strlen($word) >= 2) { $nounBased = $word; } else { $nounBased = ''; } // add only if word contains two or more characters
    foreach ($suffixesB as $suffixB) { // remove suffixes for verb-based forms (list B)
        if (substr($word, -strlen($suffixA)) == $suffixA) { // if the word ends with that suffix
            switch ($suffixB) {
                case 'iuntur', 'erunt', 'untur', 'iunt', 'unt': $word = substr($word, 0, -strlen($suffixB)).'i'; break; // replace suffix by <i>
                case 'beris', 'bor', 'bo': $word = substr($word, 0, -strlen($suffixB)).'bi'; break; // replace suffix by <bi>
                case 'ero': $word = substr($word, 0, -strlen($suffixB)).'eri'; break; // replace suffix by <eri>
                default: $word = substr($word, 0, -strlen($suffixB)); break; // remove the suffix
            }
            break; // remove only one suffix
        }
    }
    if (strlen($word) >= 2) { $verbBased = $word; } else { $verbBased = ''; } // add only if word contains two or more characters
    return array($nounBased, $verbBased);
}
?>

我的问题:

1)此代码是否正常工作?它是否遵循算法的规则?

2)你怎么能改进代码(性能)?

非常感谢你!

2 个答案:

答案 0 :(得分:2)

据我所知,这遵循链接中描述的算法,并且应该正常工作。 (除了$suffixesA定义中的语法错误之外 - 你错过了几个撇号。)

在性能方面,看起来并没有太大的收获,但有一些事情会浮现在脑海中。

如果在单次执行脚本期间多次调用它,可能会通过在函数外部定义这些数组来获得一些东西 - 我不认为PHP足够聪明,可以在调用之间缓存这些数组功能。

您还可以将这两个str_replace组合成一个:$word = str_replace(array('j','v'), array('i','u'), $word);,或者,因为您要用单个字符替换单个字符,您可以使用$word = strtr($word,'jv','iu'); - 但我不知道我认为这在实践中会有很大的不同。你必须尝试一下才能确定。

答案 1 :(得分:2)

不,你的功能不起作用,它包含语法错误。例如,您有未公开的引号,并且使用了错误的switch语法。

这是我重写的功能。由于该页面上的伪算法不是很精确,我不得不做一些解释。我解释它的方式是本文中提到的示例有效。

我也做了一些优化。第一个是我定义单词和后缀数组static。因此,对此函数的所有调用都共享相同的数组,这应该是良好的性能;)

此外,我调整了阵列,以便更有效地使用它们。我更改了$queWords数组,因此它可用于快速哈希表查找,而不是慢in_array。此外,我已经保存了数组中后缀的长度。因此,您不需要在运行时计算它们(这实际上非常慢)。我可能做了一些小的优化。

我不知道这段代码的速度有多快,但它应该快得多。此外,它现在适用于提供的示例。

以下是代码:

<?php
    function stemLatin($word) {
        static $queWords = array(
            'atque'         => 1,
            'quoque'        => 1,
            'neque'         => 1,
            'itaque'        => 1,
            'absque'        => 1,
            'apsque'        => 1,
            'abusque'       => 1,
            'adaeque'       => 1,
            'adusque'       => 1,
            'denique'       => 1,
            'deque'         => 1,
            'susque'        => 1,
            'oblique'       => 1,
            'peraeque'      => 1,
            'plenisque'     => 1,
            'quandoque'     => 1,
            'quisque'       => 1,
            'quaeque'       => 1,
            'cuiusque'      => 1,
            'cuique'        => 1,
            'quemque'       => 1,
            'quamque'       => 1,
            'quaque'        => 1,
            'quique'        => 1,
            'quorumque'     => 1,
            'quarumque'     => 1,
            'quibusque'     => 1,
            'quosque'       => 1,
            'quasque'       => 1,
            'quotusquisque' => 1,
            'quousque'      => 1,
            'ubique'        => 1,
            'undique'       => 1,
            'usque'         => 1,
            'uterque'       => 1,
            'utique'        => 1,
            'utroque'       => 1,
            'utribique'     => 1,
            'torque'        => 1,
            'coque'         => 1,
            'concoque'      => 1,
            'contorque'     => 1,
            'detorque'      => 1,
            'decoque'       => 1,
            'excoque'       => 1,
            'extorque'      => 1,
            'obtorque'      => 1,
            'optorque'      => 1,
            'retorque'      => 1,
            'recoque'       => 1,
            'attorque'      => 1,
            'incoque'       => 1,
            'intorque'      => 1,
            'praetorque'    => 1,
        );
        static $suffixesNoun = array(
            'ibus' => 4,
            'ius'  => 3,
            'ae'   => 2,
            'am'   => 2,
            'as'   => 2,
            'em'   => 2,
            'es'   => 2,
            'ia'   => 2,
            'is'   => 2,
            'nt'   => 2,
            'os'   => 2,
            'ud'   => 2,
            'um'   => 2,
            'us'   => 2,
            'a'    => 1,
            'e'    => 1,
            'i'    => 1,
            'o'    => 1,
            'u'    => 1,
        );
        static $suffixesVerb = array(
            'iuntur' => 6,
            'beris'  => 5,
            'erunt'  => 5,
            'untur'  => 5,
            'iunt'   => 4,
            'mini'   => 4,
            'ntur'   => 4,
            'stis'   => 4,
            'bor'    => 3,
            'ero'    => 3,
            'mur'    => 3,
            'mus'    => 3,
            'ris'    => 3,
            'sti'    => 3,
            'tis'    => 3,
            'tur'    => 3,
            'unt'    => 3,
            'bo'     => 2,
            'ns'     => 2,
            'nt'     => 2,
            'ri'     => 2,
            'm'      => 1,
            'r'      => 1,
            's'      => 1,
            't'      => 1,
        );

        $stems = array($word, $word);

        $word = strtr(strtolower(trim($word)), 'jv', 'iu'); // trim, lowercase and j => i, v => u

        if (substr($word, -3) == 'que') {
            if (isset($queWords[$word])) {
                return array($word, $word);
            }
            $word = substr($word, 0, -3);
        }

        foreach ($suffixesNoun as $suffix => $length) {
            if (substr($word, -$length) == $suffix) {
                $tmp = substr($word, 0, -$length);

                if (isset($tmp[1]))
                    $stems[0] = $tmp;
                break;
            }
        }

        foreach ($suffixesVerb as $suffix => $length) {
            if (substr($word, -$length) == $suffix) {
                switch ($suffix) {
                    case 'iuntur':
                    case 'erunt':
                    case 'untur':
                    case 'iunt':
                    case 'unt':
                        $tmp = substr_replace($word, 'i', -$length, $length);
                    break;
                    case 'beris':
                    case 'bor':
                    case 'bo':
                        $tmp = substr_replace($word, 'bi', -$length, $length);
                    break;
                    case 'ero':
                        $tmp = substr_replace($word, 'eri', -$length, $length);
                    break;
                    default:
                        $tmp = substr($word, 0, -$length);
                }

                if (isset($tmp[1]))
                    $stems[1] = $tmp;
                break;
            }
        }

        return $stems;
    }

    var_dump(stemLatin('aquila'));
    var_dump(stemLatin('portat'));
    var_dump(stemLatin('portis'));