我正在尝试为某些文本数据创建一个简单的解析器。 (该文本使用NLTK没有任何解析器的语言。)
基本上,我的前缀数量有限,可以是一个或两个字母;一个单词可以有多个前缀。我也有一两个字母的后缀数量有限。它们之间的任何东西都应该是这个词的“根”。许多单词将有更多可能的解析,所以我想输入一个单词并以元组(前缀,根,后缀)的形式返回可能的解析列表。
我无法弄清楚如何构造代码。我粘贴了一个我试过的方法的例子(使用一些虚拟英语数据使其更容易理解),但显然不对。首先,它真的很丑陋和多余,所以我确信有更好的方法来做到这一点。另一方面,它不适用于具有多个前缀或后缀,或前缀(es)和后缀(es)的单词。
有什么想法吗?
prefixes = ['de','con']
suffixes = ['er','s']
def parser(word):
poss_parses = []
if word[0:2] in prefixes:
poss_parses.append((word[0:2],word[2:],''))
if word[0:3] in prefixes:
poss_parses.append((word[0:3],word[3:],''))
if word[-2:-1] in prefixes:
poss_parses.append(('',word[:-2],word[-2:-1]))
if word[-3:-1] in prefixes:
poss_parses.append(('',word[:-3],word[-3:-1]))
if word[0:2] in prefixes and word[-2:-1] in suffixes and len(word[2:-2])>2:
poss_parses.append((word[0:2],word[2:-2],word[-2:-1]))
if word[0:2] in prefixes and word[-3:-1] in suffixes and len(word[2:-3])>2:
poss_parses.append((word[0:2],word[2:-2],word[-3:-1]))
if word[0:3] in prefixes and word[-2:-1] in suffixes and len(word[3:-2])>2:
poss_parses.append((word[0:2],word[2:-2],word[-2:-1]))
if word[0:3] in prefixes and word[-3:-1] in suffixes and len(word[3:-3])>2:
poss_parses.append((word[0:3],word[3:-2],word[-3:-1]))
return poss_parses
>>> wordlist = ['construct','destructer','constructs','deconstructs']
>>> for w in wordlist:
... parses = parser(w)
... print w
... for p in parses:
... print p
...
construct
('con', 'struct', '')
destructer
('de', 'structer', '')
constructs
('con', 'structs', '')
deconstructs
('de', 'constructs', '')
答案 0 :(得分:2)
这是我的解决方案:
prefixes = ['de','con']
suffixes = ['er','s']
def parse(word):
prefix = ''
suffix = ''
# find all prefixes
found = True
while found:
found = False
for p in prefixes:
if word.startswith(p):
prefix += p
word = word[len(p):] # remove prefix from word
found = True
# find all suffixes
found = True
while found:
found = False
for s in suffixes:
if word.endswith(s):
suffix = s + suffix
word = word[:-len(s)] # remove suffix from word
found = True
return (prefix, word, suffix)
print parse('construct')
print parse ('destructer')
print parse('deconstructs')
print parse('deconstructers')
print parse('deconstructser')
print parse('condestructser')
<强>结果:强>
>>>
('con', 'struct', '')
('de', 'struct', 'er')
('decon', 'struct', 's')
('decon', 'struct', 'ers')
('decon', 'struct', 'ser')
('conde', 'struct', 'ser')
这个想法是循环遍历所有前缀并聚合它们,同时将它们从单词中删除。棘手的部分是,定义前缀的顺序可能会隐藏找不到的前缀,因此必须重新调用迭代,直到找到所有前缀。
后缀也是如此,除了我们以相反的顺序构建后缀词。
答案 1 :(得分:2)
CodeChords男人打败了我,但是由于我的前缀和后缀为元组(根据上下文可能或多或少有用),并使用递归,我想我还是会发布它。
class Parser():
PREFIXES = ['de', 'con']
SUFFIXES = ['er', 's']
MINUMUM_STEM_LENGTH = 3
@classmethod
def prefixes(cls, word, internal=False):
stem = word
prefix = None
for potential_prefix in cls.PREFIXES:
if word.startswith(potential_prefix):
prefix = potential_prefix
stem = word[len(prefix):]
if len(stem) >= cls.MINUMUM_STEM_LENGTH:
break
else:
prefix = None
stem = word
if prefix:
others, stem = cls.prefixes(stem, True)
others.append(prefix)
return (others, stem) if internal else (reversed(others), stem)
else:
return [], stem
@classmethod
def suffixes(cls, word):
suffix = None
stem = word
for potential_suffix in cls.SUFFIXES:
if word.endswith(potential_suffix):
suffix = potential_suffix
stem = word[:-len(suffix)]
if len(stem) >= cls.MINUMUM_STEM_LENGTH:
break
else:
suffix = None
stem = word
if suffix:
others, stem = cls.suffixes(stem)
others.append(suffix)
return others, stem
else:
return [], stem
@classmethod
def parse(cls, word):
prefixes, word = cls.prefixes(word)
suffixes, word = cls.suffixes(word)
return(tuple(prefixes), word, tuple(suffixes))
words = ['con', 'deAAers', 'deAAAers', 'construct', 'destructer', 'constructs', 'deconstructs', 'deconstructers']
parser = Parser()
for word in words:
print(parser.parse(word))
这给了我们:
((), 'con', ())
(('de',), 'AAer', ('s',))
(('de',), 'AAA', ('er', 's'))
(('con',), 'struct', ())
(('de',), 'struct', ('er',))
(('con',), 'struct', ('s',))
(('de', 'con'), 'struct', ('s',))
(('de', 'con'), 'struct', ('er', 's'))
这可以通过获取单词,并使用str.startswith()
函数来查找前缀。它递归执行,直到它被缩减为没有前缀的单词,然后传回前缀列表。
然后它会对后缀执行类似的操作,除非使用str.endswith()
,原因显而易见。
答案 2 :(得分:2)
Pyparsing将字符串索引和标记提取包装到自己的解析框架中,并允许您使用简单的算术语法来构建解析定义:
wordlist = ['construct','destructer','constructs','deconstructs']
from pyparsing import StringEnd, oneOf, FollowedBy, Optional, ZeroOrMore, SkipTo
endOfString = StringEnd()
prefix = oneOf("de con")
suffix = oneOf("er s") + FollowedBy(endOfString)
word = (ZeroOrMore(prefix)("prefixes") +
SkipTo(suffix | endOfString)("root") +
Optional(suffix)("suffix"))
for wd in wordlist:
print wd
res = word.parseString(wd)
print res.dump()
print res.prefixes
print res.root
print res.suffix
print
结果在一个名为ParseResults的富对象中返回,该对象可以作为简单列表,具有命名属性的对象或作为dict进行访问。该程序的输出是:
construct
['con', 'struct']
- prefixes: ['con']
- root: struct
['con']
struct
destructer
['de', 'struct', 'er']
- prefixes: ['de']
- root: struct
- suffix: ['er']
['de']
struct
['er']
constructs
['con', 'struct', 's']
- prefixes: ['con']
- root: struct
- suffix: ['s']
['con']
struct
['s']
deconstructs
['de', 'con', 'struct', 's']
- prefixes: ['de', 'con']
- root: struct
- suffix: ['s']
['de', 'con']
struct
['s']