对于词形还原,spacy有一个lists of words:形容词,副词,动词......还有例外列表:adverbs_irreg ...对于常规词,有一组rules
我们以“更宽”这个词为例。
因为它是一个形容词,所以词典化的规则应该从这个列表中获取:
ADJECTIVE_RULES = [
["er", ""],
["est", ""],
["er", "e"],
["est", "e"]
]
据我所知,这个过程将是这样的:
1)获取单词的POS标签,以了解它是名词,动词......
2)如果没有应用其中一条规则,如果单词在不规则案例列表中被直接替换。
现在,如何决定使用“呃” - > “e”而不是“er” - > “”得到“宽”而不是“wid”?
Here可以对其进行测试。
答案 0 :(得分:10)
让我们从课程定义开始:https://github.com/explosion/spaCy/blob/develop/spacy/lemmatizer.py
首先初始化3个变量:
class Lemmatizer(object):
@classmethod
def load(cls, path, index=None, exc=None, rules=None):
return cls(index or {}, exc or {}, rules or {})
def __init__(self, index, exceptions, rules):
self.index = index
self.exc = exceptions
self.rules = rules
现在,查看英语self.exc
,我们发现它指向https://github.com/explosion/spaCy/tree/develop/spacy/lang/en/lemmatizer/init.py,它从目录https://github.com/explosion/spaCy/tree/master/spacy/en/lemmatizer加载文件
最有可能的原因是,通过I / O声明字符串in-code比通过I / O传输字符串更快。
仔细观察,它们似乎都来自原始的普林斯顿WordNet https://wordnet.princeton.edu/man/wndb.5WN.html
<强>规则强>
更近看,https://github.com/explosion/spaCy/tree/develop/spacy/lang/en/lemmatizer/_lemma_rules.py上的规则类似于_morphy
https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L1749
nltk
规则
这些规则最初来自Morphy
软件https://wordnet.princeton.edu/man/morphy.7WN.html
此外,spacy
还包含了普林斯顿·莫菲(Princeton Morphy)的一些标点符号规则:
PUNCT_RULES = [
["“", "\""],
["”", "\""],
["\u2018", "'"],
["\u2019", "'"]
]
<强>例外强>
至于例外,它们存储在*_irreg.py
的{{1}}文件中,看起来它们也来自普林斯顿Wordnet。
很明显,如果我们查看原始WordNet spacy
(排除)文件的某些镜像(例如https://github.com/extjwnl/extjwnl-data-wn21/blob/master/src/main/resources/net/sf/extjwnl/data/wordnet/wn21/adj.exc),并且从.exc
下载wordnet
包,我们看到它是相同的列表:
nltk
<强>索引强>
如果我们查看alvas@ubi:~/nltk_data/corpora/wordnet$ ls
adj.exc cntlist.rev data.noun index.adv index.verb noun.exc
adv.exc data.adj data.verb index.noun lexnames README
citation.bib data.adv index.adj index.sense LICENSE verb.exc
alvas@ubi:~/nltk_data/corpora/wordnet$ wc -l adj.exc
1490 adj.exc
lemmatizer spacy
,我们会发现它也来自Wordnet,例如https://github.com/explosion/spaCy/tree/develop/spacy/lang/en/lemmatizer/_adjectives.py以及index
中的wordnet重新分发的副本:
nltk
基于alvas@ubi:~/nltk_data/corpora/wordnet$ head -n40 data.adj
1 This software and database is being provided to you, the LICENSEE, by
2 Princeton University under the following license. By obtaining, using
3 and/or copying this software and database, you agree that you have
4 read, understood, and will comply with these terms and conditions.:
5
6 Permission to use, copy, modify and distribute this software and
7 database and its documentation for any purpose and without fee or
8 royalty is hereby granted, provided that you agree to comply with
9 the following copyright notice and statements, including the disclaimer,
10 and that the same appear on ALL copies of the software, database and
11 documentation, including modifications that you make for internal
12 use or for distribution.
13
14 WordNet 3.0 Copyright 2006 by Princeton University. All rights reserved.
15
16 THIS SOFTWARE AND DATABASE IS PROVIDED "AS IS" AND PRINCETON
17 UNIVERSITY MAKES NO REPRESENTATIONS OR WARRANTIES, EXPRESS OR
18 IMPLIED. BY WAY OF EXAMPLE, BUT NOT LIMITATION, PRINCETON
19 UNIVERSITY MAKES NO REPRESENTATIONS OR WARRANTIES OF MERCHANT-
20 ABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE OR THAT THE USE
21 OF THE LICENSED SOFTWARE, DATABASE OR DOCUMENTATION WILL NOT
22 INFRINGE ANY THIRD PARTY PATENTS, COPYRIGHTS, TRADEMARKS OR
23 OTHER RIGHTS.
24
25 The name of Princeton University or Princeton may not be used in
26 advertising or publicity pertaining to distribution of the software
27 and/or database. Title to copyright in this software, database and
28 any associated documentation shall at all times remain with
29 Princeton University and LICENSEE agrees to preserve same.
00001740 00 a 01 able 0 005 = 05200169 n 0000 = 05616246 n 0000 + 05616246 n 0101 + 05200169 n 0101 ! 00002098 a 0101 | (usually followed by `to') having the necessary means or skill or know-how or authority to do something; "able to swim"; "she was able to program her computer"; "we were at last able to buy a car"; "able to get a grant for the project"
00002098 00 a 01 unable 0 002 = 05200169 n 0000 ! 00001740 a 0101 | (usually followed by `to') not having the necessary means or skill or know-how; "unable to get to town without a car"; "unable to obtain funds"
00002312 00 a 02 abaxial 0 dorsal 4 002 ;c 06037666 n 0000 ! 00002527 a 0101 | facing away from the axis of an organ or organism; "the abaxial surface of a leaf is the underside or side facing away from the stem"
00002527 00 a 02 adaxial 0 ventral 4 002 ;c 06037666 n 0000 ! 00002312 a 0101 | nearest to or facing toward the axis of an organ or organism; "the upper side of a leaf is known as the adaxial surface"
00002730 00 a 01 acroscopic 0 002 ;c 06066555 n 0000 ! 00002843 a 0101 | facing or on the side toward the apex
00002843 00 a 01 basiscopic 0 002 ;c 06066555 n 0000 ! 00002730 a 0101 | facing or on the side toward the base
00002956 00 a 02 abducent 0 abducting 0 002 ;c 06080522 n 0000 ! 00003131 a 0101 | especially of muscles; drawing away from the midline of the body or from an adjacent part
00003131 00 a 03 adducent 0 adductive 0 adducting 0 003 ;c 06080522 n 0000 + 01449236 v 0201 ! 00002956 a 0101 | especially of muscles; bringing together or drawing toward the midline of the body or toward an adjacent part
00003356 00 a 01 nascent 0 005 + 07320302 n 0103 ! 00003939 a 0101 & 00003553 a 0000 & 00003700 a 0000 & 00003829 a 0000 | being born or beginning; "the nascent chicks"; "a nascent insurgency"
00003553 00 s 02 emergent 0 emerging 0 003 & 00003356 a 0000 + 02625016 v 0102 + 00050693 n 0101 | coming into existence; "an emergent republic"
00003700 00 s 01 dissilient 0 002 & 00003356 a 0000 + 07434782 n 0101 | bursting open with force, as do some ripe seed vessels
引理器使用的字典,例外和规则主要来自普林斯顿WordNet及其Morphy软件,我们可以继续查看spacy
如何应用规则的实际实现使用索引和例外。
我们回到https://github.com/explosion/spaCy/blob/develop/spacy/lemmatizer.py
主要操作来自函数而不是spacy
类:
Lemmatizer
def lemmatize(string, index, exceptions, rules):
string = string.lower()
forms = []
# TODO: Is this correct? See discussion in Issue #435.
#if string in index:
# forms.append(string)
forms.extend(exceptions.get(string, []))
oov_forms = []
for old, new in rules:
if string.endswith(old):
form = string[:len(string) - len(old)] + new
if not form:
pass
elif form in index or not form.isalpha():
forms.append(form)
else:
oov_forms.append(form)
if not forms:
forms.extend(oov_forms)
if not forms:
forms.append(string)
return set(forms)
类之外的lemmatize
方法?我并不完全确定,但也许,确保可以在类实例之外调用词形还原函数但是假设存在@staticmethod
and @classmethod
或许还有其他考虑因素。为什么函数和类已经解耦
将Lemmatizer
lemmatize()函数与nltk中的morphy()
函数进行比较(最初来自十多年前创建的http://blog.osteele.com/2004/04/pywordnet-20/),spacy
,主要过程在Oliver Steele的WordNet形态的Python端口是:
对于morphy()
,可能它仍处于开发状态,因为spacy
位于第https://github.com/explosion/spaCy/blob/develop/spacy/lemmatizer.py#L76行
但一般过程似乎是:
就OOV处理而言,如果没有找到词形化形式,spacy将返回原始字符串,在这方面,TODO
nltk
的实现也是如此,例如。
morphy
可能另一个不同点是>>> from nltk.stem import WordNetLemmatizer
>>> wnl = WordNetLemmatizer()
>>> wnl.lemmatize('alvations')
'alvations'
和morphy
如何决定分配给该单词的POS。在这方面spacy
puts some linguistics rule in the Lemmatizer()
to decide whether a word is the base form and skips the lemmatization entirely if the word is already in the infinitive form (is_base_form()),如果要对语料库中的所有单词进行词形还原,并且其中很大一部分是不定式(已经是引理形式),这将节省相当多的时间。
但spacy
可能会出现这种情况,因为它允许引理器访问与某些形态规则紧密相关的POS。虽然对于spacy
虽然可以使用细粒度的PTB POS标签找出一些形态,但仍然需要花些精力来对它们进行排序以了解哪些形式是不定式的。
概述,形态特征的3个主要信号需要在POS标签中取消:
SpaCy在最初的答案(5月12日)之后确实对他们的变形器进行了更改。我认为目的是在没有查找和规则处理的情况下更快地进行词形还原。
因此,他们将词语预先词形化,并将它们保留在查找哈希表中,以便为他们预先词形化的词语https://github.com/explosion/spaCy/blob/master/spacy/lang/en/lemmatizer/lookup.py
进行检索O(1)此外,在努力统一语言的词形变换器时,词形变换器现在位于https://github.com/explosion/spaCy/blob/develop/spacy/lemmatizer.py#L92
但上面讨论的基础词形还原步骤仍然与当前的spacy版本相关(morphy
)
我想现在我们知道它适用于语言学规则和所有,另一个问题是&#34;是否存在任何非基于规则的词形还原方法?&#34;
但在回答之前的问题之前,&#34;究竟是什么引理?&#34;可能是更好的问题。
答案 1 :(得分:8)
TLDR: spaCy会检查它尝试生成的引理是否在已知的单词列表或该部分语句的例外情况中。
长答案:
查看lemmatizer.py文件,特别是底部的lemmatize
功能。
def lemmatize(string, index, exceptions, rules):
string = string.lower()
forms = []
forms.extend(exceptions.get(string, []))
oov_forms = []
for old, new in rules:
if string.endswith(old):
form = string[:len(string) - len(old)] + new
if not form:
pass
elif form in index or not form.isalpha():
forms.append(form)
else:
oov_forms.append(form)
if not forms:
forms.extend(oov_forms)
if not forms:
forms.append(string)
return set(forms)
对于英语形容词,例如,它引用了我们正在评估的字符串,已知形容词的index
,exceptions
和rules
,如您所引用的那样,来自this directory(英文模特)。
在使字符串小写后我们在lemmatize
中做的第一件事是检查字符串是否在我们的已知异常列表中,其中包括像“更糟糕”这样的单词的引理规则 - &gt; “坏”。
然后我们浏览我们的rules
并将每一个应用到字符串中(如果适用)。对于单词wider
,我们将应用以下规则:
["er", ""],
["est", ""],
["er", "e"],
["est", "e"]
我们将输出以下表格:["wid", "wide"]
。
然后,我们检查这个表单是否在我们index
的已知形容词中。如果是,我们将其附加到表单中。否则,我们将其添加到oov_forms
,我猜这是词汇量的缩写。 wide
位于索引中,因此会被添加。 wid
已添加oov_forms
。
最后,我们返回一组找到的词条,或者匹配规则但不在我们索引中的任何词条,或者只返回单词本身。
您在上面发布的word-lemmatize链接适用于wider
,因为wide
位于单词索引中。尝试类似He is blandier than I.
spaCy会将blandier
(单词I组成)标记为形容词,但它不在索引中,因此它只会返回blandier
作为引理。