在Python中将单词解析为(前缀,根,后缀)

时间:2012-04-14 19:03:14

标签: python parsing nlp

我正在尝试为某些文本数据创建一个简单的解析器。 (该文本使用NLTK没有任何解析器的语言。)

基本上,我的前缀数量有限,可以是一个或两个字母;一个单词可以有多个前缀。我也有一两个字母的后缀数量有限。它们之间的任何东西都应该是这个词的“根”。许多单词将有更多可能的解析,所以我想输入一个单词并以元组(前缀,根,后缀)的形式返回可能的解析列表。

我无法弄清楚如何构造代码。我粘贴了一个我试过的方法的例子(使用一些虚拟英语数据使其更容易理解),但显然不对。首先,它真的很丑陋和多余,所以我确信有更好的方法来做到这一点。另一方面,它不适用于具有多个前缀或后缀,或前缀(es)和后缀(es)的单词。

有什么想法吗?

prefixes = ['de','con']
suffixes = ['er','s']

def parser(word):
    poss_parses = []
    if word[0:2] in prefixes:
        poss_parses.append((word[0:2],word[2:],''))
    if word[0:3] in prefixes:
        poss_parses.append((word[0:3],word[3:],''))
    if word[-2:-1] in prefixes:
        poss_parses.append(('',word[:-2],word[-2:-1]))
    if word[-3:-1] in prefixes:
        poss_parses.append(('',word[:-3],word[-3:-1]))
    if word[0:2] in prefixes and word[-2:-1] in suffixes and len(word[2:-2])>2:
        poss_parses.append((word[0:2],word[2:-2],word[-2:-1]))
    if word[0:2] in prefixes and word[-3:-1] in suffixes and len(word[2:-3])>2:
        poss_parses.append((word[0:2],word[2:-2],word[-3:-1]))
    if word[0:3] in prefixes and word[-2:-1] in suffixes and len(word[3:-2])>2:
        poss_parses.append((word[0:2],word[2:-2],word[-2:-1]))
    if word[0:3] in prefixes and word[-3:-1] in suffixes and len(word[3:-3])>2:
        poss_parses.append((word[0:3],word[3:-2],word[-3:-1]))
    return poss_parses



>>> wordlist = ['construct','destructer','constructs','deconstructs']
>>> for w in wordlist:
...   parses = parser(w)
...   print w
...   for p in parses:
...     print p
... 
construct
('con', 'struct', '')
destructer
('de', 'structer', '')
constructs
('con', 'structs', '')
deconstructs
('de', 'constructs', '')

3 个答案:

答案 0 :(得分:2)

这是我的解决方案:

prefixes = ['de','con']
suffixes = ['er','s']

def parse(word):
    prefix = ''
    suffix = ''

    # find all prefixes
    found = True
    while found:
        found = False
        for p in prefixes:
            if word.startswith(p):
                prefix += p
                word = word[len(p):] # remove prefix from word
                found = True

    # find all suffixes
    found = True
    while found:
        found = False
        for s in suffixes:
            if word.endswith(s):
                suffix = s + suffix
                word = word[:-len(s)] # remove suffix from word
                found = True

    return (prefix, word, suffix)

print parse('construct')
print parse ('destructer')
print parse('deconstructs')
print parse('deconstructers')
print parse('deconstructser')
print parse('condestructser')

<强>结果:

>>> 
('con', 'struct', '')
('de', 'struct', 'er')
('decon', 'struct', 's')
('decon', 'struct', 'ers')
('decon', 'struct', 'ser')
('conde', 'struct', 'ser')

这个想法是循环遍历所有前缀并聚合它们,同时将它们从单词中删除。棘手的部分是,定义前缀的顺序可能会隐藏找不到的前缀,因此必须重新调用迭代,直到找到所有前缀。

后缀也是如此,除了我们以相反的顺序构建后缀词。

答案 1 :(得分:2)

CodeChords男人打败了我,但是由于我的前缀和后缀为元组(根据上下文可能或多或少有用),并使用递归,我想我还是会发布它。

class Parser():
    PREFIXES = ['de', 'con']
    SUFFIXES = ['er', 's']
    MINUMUM_STEM_LENGTH = 3

    @classmethod
    def prefixes(cls, word, internal=False):
        stem = word
        prefix = None
        for potential_prefix in cls.PREFIXES:
            if word.startswith(potential_prefix):
                prefix = potential_prefix
                stem = word[len(prefix):]
                if len(stem) >= cls.MINUMUM_STEM_LENGTH:
                    break
                else:
                    prefix = None
                    stem = word
        if prefix:
            others, stem = cls.prefixes(stem, True)
            others.append(prefix)
            return (others, stem) if internal else (reversed(others), stem)
        else:
            return [], stem

    @classmethod
    def suffixes(cls, word):
        suffix = None
        stem = word
        for potential_suffix in cls.SUFFIXES:
            if word.endswith(potential_suffix):
                suffix = potential_suffix
                stem = word[:-len(suffix)]
                if len(stem) >= cls.MINUMUM_STEM_LENGTH:
                    break
                else:
                    suffix = None
                    stem = word
        if suffix:
            others, stem = cls.suffixes(stem)
            others.append(suffix)
            return others, stem
        else:
            return [], stem

    @classmethod
    def parse(cls, word):
        prefixes, word = cls.prefixes(word)
        suffixes, word = cls.suffixes(word)
        return(tuple(prefixes), word, tuple(suffixes))

words = ['con', 'deAAers', 'deAAAers', 'construct', 'destructer', 'constructs', 'deconstructs', 'deconstructers']

parser = Parser()
for word in words:
    print(parser.parse(word))

这给了我们:

((), 'con', ())
(('de',), 'AAer', ('s',))
(('de',), 'AAA', ('er', 's'))
(('con',), 'struct', ())
(('de',), 'struct', ('er',))
(('con',), 'struct', ('s',))
(('de', 'con'), 'struct', ('s',))
(('de', 'con'), 'struct', ('er', 's'))

这可以通过获取单词,并使用str.startswith()函数来查找前缀。它递归执行,直到它被缩减为没有前缀的单词,然后传回前缀列表。

然后它会对后缀执行类似的操作,除非使用str.endswith(),原因显而易见。

答案 2 :(得分:2)

Pyparsing将字符串索引和标记提取包装到自己的解析框架中,并允许您使用简单的算术语法来构建解析定义:

wordlist = ['construct','destructer','constructs','deconstructs']

from pyparsing import StringEnd, oneOf, FollowedBy, Optional, ZeroOrMore, SkipTo

endOfString = StringEnd()
prefix = oneOf("de con")
suffix = oneOf("er s") + FollowedBy(endOfString)

word = (ZeroOrMore(prefix)("prefixes") + 
        SkipTo(suffix | endOfString)("root") + 
        Optional(suffix)("suffix"))

for wd in wordlist:
    print wd
    res = word.parseString(wd)
    print res.dump()
    print res.prefixes
    print res.root
    print res.suffix
    print

结果在一个名为ParseResults的富对象中返回,该对象可以作为简单列表,具有命名属性的对象或作为dict进行访问。该程序的输出是:

construct
['con', 'struct']
- prefixes: ['con']
- root: struct
['con']
struct


destructer
['de', 'struct', 'er']
- prefixes: ['de']
- root: struct
- suffix: ['er']
['de']
struct
['er']

constructs
['con', 'struct', 's']
- prefixes: ['con']
- root: struct
- suffix: ['s']
['con']
struct
['s']

deconstructs
['de', 'con', 'struct', 's']
- prefixes: ['de', 'con']
- root: struct
- suffix: ['s']
['de', 'con']
struct
['s']