我试图用python从定义的语法生成句子,以避免协议问题我使用了特征结构,
这是我到目前为止所做的代码:
>>> from __future__ import print_function
>>> import nltk
>>> from nltk.featstruct import FeatStruct
>>> from nltk import grammar, parse
>>> from nltk.parse.generate import generate
>>> from nltk import CFG
>>> g = """
% start DP
DP-> D[AGR=[NUM='sg', PERS=3, GND='m']] N[AGR=[NUM='sg', GND='m']]
D[AGR=[NUM='sg', PERS=3, GND='f']] -> 'une' | 'la'
D[AGR=[NUM='sg', PERS=3, GND='m']] -> 'un' | 'le'
D[AGR=[NUM='pl', PERS=3]] -> 'des' | 'les'
N[AGR=[NUM='sg', GND='m']] -> 'garçon'
N[AGR=[NUM='pl', GND='m']] -> 'garçons'
N[AGR=[NUM='sg', GND='f']] -> 'fille'
N[AGR=[NUM='pl', GND='f']] -> 'filles'
"""
>>> for sentence in generate(grammar, n=30):
print(''.join(sentence))
这是我得到的输出:
unegarçon
unegarçons
unefille
unefilles
lagarçon
lagarçons
lafille
lafilles
ungarçon
ungarçons
unfille
unfilles
legarçon
legarçons
lefille
lefilles
desgarçon
desgarçons
desfille
desfilles
lesgarçon
lesgarçons
lesfille
lesfilles
虽然我应该有这样的输出:
un garçon
le garçon
我遇到的问题是:
协议没有成功,我的判决不符合协议
句子中的两个单词之间没有空格。
我看不到什么?
答案 0 :(得分:2)
让我们先解决问题的简单部分。
在打印方面你很接近=)
问题在于你如何使用str.join
功能。
>>> list_of_str = ['a', 'b', 'c']
>>> ''.join(list_of_str)
'abc'
>>> ' '.join(list_of_str)
'a b c'
>>> '|'.join(list_of_str)
'a|b|c'
要生成具有协议的特征结构语法,应该有一条规则在右侧包含D[AGR=?a] N[AGR=?a]
(RHS),例如
NP -> D[AGR=?a] N[AGR=?a]
如果遗漏了语法中没有真正的协议规则,请参阅http://www.nltk.org/howto/featgram.html
如果我们仔细查看nltk.parse.generate
代码,它只会产生终端的所有可能组合,而且看起来它并不关心特征结构:https://github.com/nltk/nltk/blob/develop/nltk/parse/generate.py
(我认为这不是一个特征,所以将问题提交到NLTK存储库会很好)
因此,如果我们这样做,它将打印所有可能终端的组合(无需遵守协议):
from nltk import grammar, parse
from nltk.parse.generate import generate
# If person is always 3rd, we can skip the PERSON feature.
g = """
DP -> D[AGR=?a] N[AGR=?a]
N[AGR=[NUM='sg', GND='m']] -> 'garcon'
N[AGR=[NUM='sg', GND='f']] -> 'fille'
D[AGR=[NUM='sg', GND='m']] -> 'un'
D[AGR=[NUM='sg', GND='f']] -> 'une'
"""
grammar = grammar.FeatureGrammar.fromstring(g)
print(list(generate(grammar, n=30)))
[OUT]:
[['un', 'garcon'], ['un', 'fille'], ['une', 'garcon'], ['une', 'fille']]
from nltk import grammar, parse
from nltk.parse.generate import generate
g = """
DP -> D[AGR=?a] N[AGR=?a]
N[AGR=[NUM='sg', GND='m']] -> 'garcon'
N[AGR=[NUM='sg', GND='f']] -> 'fille'
D[AGR=[NUM='sg', GND='m']] -> 'un'
D[AGR=[NUM='sg', GND='f']] -> 'une'
"""
grammar = grammar.FeatureGrammar.fromstring(g)
parser = parse.FeatureEarleyChartParser(grammar)
trees = parser.parse('une garcon'.split()) # Invalid sentence.
print ("Parses for 'une garcon':", list(trees))
trees = parser.parse('un garcon'.split()) # Valid sentence.
print ("Parses for 'un garcon':", list(trees))
[OUT]:
Parses for 'une garcon': []
Parses for 'un garcon': [Tree(DP[], [Tree(D[AGR=[GND='m', NUM='sg']], ['un']), Tree(N[AGR=[GND='m', NUM='sg']], ['garcon'])])]
from nltk import grammar, parse
from nltk.parse.generate import generate
g = """
DP -> D[AGR=?a] N[AGR=?a]
N[AGR=[NUM='sg', GND='m']] -> 'garcon'
N[AGR=[NUM='sg', GND='f']] -> 'fille'
D[AGR=[NUM='sg', GND='m']] -> 'un'
D[AGR=[NUM='sg', GND='f']] -> 'une'
"""
grammar = grammar.FeatureGrammar.fromstring(g)
parser = parse.FeatureEarleyChartParser(grammar)
for tokens in list(generate(grammar, n=30)):
parsed_tokens = parser.parse(tokens)
try:
first_parse = next(parsed_tokens) # Check if there's a valid parse.
print(' '.join(first_parse.leaves()))
except StopIteration:
continue
[OUT]:
un garcon
une fille
from nltk import grammar, parse
from nltk.parse.generate import generate
g = """
DP -> D[AGR=?a] N[AGR=?a]
N[AGR=[NUM='sg', GND='m']] -> 'garcon'
N[AGR=[NUM='sg', GND='f']] -> 'fille'
N[AGR=[NUM='pl', GND='m']] -> 'garcons'
N[AGR=[NUM='pl', GND='f']] -> 'filles'
D[AGR=[NUM='sg', GND='m']] -> 'un'
D[AGR=[NUM='sg', GND='f']] -> 'une'
D[AGR=[NUM='sg', GND='m']] -> 'le'
D[AGR=[NUM='sg', GND='f']] -> 'la'
D[AGR=[NUM='pl', GND='m']] -> 'les'
D[AGR=[NUM='pl', GND='f']] -> 'les'
"""
grammar = grammar.FeatureGrammar.fromstring(g)
parser = parse.FeatureEarleyChartParser(grammar)
valid_productions = set()
for tokens in list(generate(grammar, n=30)):
parsed_tokens = parser.parse(tokens)
try:
first_parse = next(parsed_tokens) # Check if there's a valid parse.
valid_productions.add(' '.join(first_parse.leaves()))
except StopIteration:
continue
for np in sorted(valid_productions):
print(np)
[OUT]:
la fille
le garcon
les filles
les garcons
un garcon
une fille
语法的TOP(又称START)必须有多个分支,目前DP -> D[AGR=?a] N[AGR=?a]
规则在TOP,以允许PP
构造,我们需要类似的东西PHRASE -> DP | PP
并将PHRASE
非终端设为新TOP,例如
from nltk import grammar, parse
from nltk.parse.generate import generate
g = """
PHRASE -> DP | PP
DP -> D[AGR=?a] N[AGR=?a]
PP -> P[AGR=?a] N[AGR=?a]
P[AGR=[NUM='sg', GND='m']] -> 'du' | 'au'
N[AGR=[NUM='sg', GND='m']] -> 'garcon'
N[AGR=[NUM='sg', GND='f']] -> 'fille'
N[AGR=[NUM='pl', GND='m']] -> 'garcons'
N[AGR=[NUM='pl', GND='f']] -> 'filles'
D[AGR=[NUM='sg', GND='m']] -> 'un'
D[AGR=[NUM='sg', GND='f']] -> 'une'
D[AGR=[NUM='sg', GND='m']] -> 'le'
D[AGR=[NUM='sg', GND='f']] -> 'la'
D[AGR=[NUM='pl', GND='m']] -> 'les'
D[AGR=[NUM='pl', GND='f']] -> 'les'
"""
french_grammar = grammar.FeatureGrammar.fromstring(g)
parser = parse.FeatureEarleyChartParser(french_grammar)
valid_productions = set()
for tokens in list(generate(french_grammar, n=100)):
parsed_tokens = parser.parse(tokens)
try:
first_parse = next(parsed_tokens) # Check if there's a valid parse.
valid_productions.add(' '.join(first_parse.leaves()))
except StopIteration:
continue
for np in sorted(valid_productions):
print(np)
[OUT]:
au garcon
du garcon
la fille
le garcon
les filles
les garcons
un garcon
une fille
from nltk import grammar, parse
from nltk.parse.generate import generate
g = """
PHRASE -> DP | PP
DP -> D[AGR=?a] N[AGR=?a]
PP -> P[AGR=[GND='m', NUM='sg']] N[AGR=[GND='m', NUM='sg']]
PP -> P[AGR=[GND='f', NUM='sg']] D[AGR=[GND='f', NUM='sg', DEF='d']] N[AGR=[GND='f', NUM='sg']]
PP -> P[AGR=[GND=?a, NUM='pl']] N[AGR=[GND=?a, NUM='pl']]
P[AGR=[NUM='sg', GND='m']] -> 'du' | 'au'
P[AGR=[NUM='sg', GND='f']] -> 'de' | 'à'
P[AGR=[NUM='pl']] -> 'des' | 'aux'
N[AGR=[NUM='sg', GND='m']] -> 'garcon'
N[AGR=[NUM='sg', GND='f']] -> 'fille'
N[AGR=[NUM='pl', GND='m']] -> 'garcons'
N[AGR=[NUM='pl', GND='f']] -> 'filles'
D[AGR=[NUM='sg', GND='m', DEF='i']] -> 'un'
D[AGR=[NUM='sg', GND='f', DEF='i']] -> 'une'
D[AGR=[NUM='sg', GND='m', DEF='d']] -> 'le'
D[AGR=[NUM='sg', GND='f', DEF='d']] -> 'la'
D[AGR=[NUM='pl', GND='m']] -> 'les'
D[AGR=[NUM='pl', GND='f']] -> 'les'
"""
french_grammar = grammar.FeatureGrammar.fromstring(g)
parser = parse.FeatureEarleyChartParser(french_grammar)
valid_productions = set()
for tokens in list(generate(french_grammar, n=100000)):
parsed_tokens = parser.parse(tokens)
try:
first_parse = next(parsed_tokens) # Check if there's a valid parse.
valid_productions.add(' '.join(first_parse.leaves()))
except StopIteration:
continue
for np in sorted(valid_productions):
print(np)
[OUT]:
au garcon
aux filles
aux garcons
de la fille
des filles
des garcons
du garcon
la fille
le garcon
les filles
les garcons
un garcon
une fille
à la fille
也可以产生de|a un(e) garcon|fille
,即
但是我不确定它们是否是有效的法语短语,但是如果它们是你可以指定女性单数PP规则并删除DEF
功能:
PP -> P[AGR=[GND='f', NUM='sg']] D[AGR=[GND='f', NUM='sg', DEF='d']] N[AGR=[GND='f', NUM='sg']]
为:
PP -> P[AGR=[GND='f', NUM='sg']] D[AGR=[GND='f', NUM='sg']] N[AGR=[GND='f', NUM='sg']]
然后添加一个额外的规则来产生男性奇异不定PP
PP -> P[AGR=[GND='f', NUM='sg']] D[AGR=[GND='m', NUM='sg', DEF='i']] N[AGR=[GND='m', NUM='sg']]
from nltk import grammar, parse
from nltk.parse.generate import generate
g = """
PHRASE -> DP | PP
DP -> D[AGR=?a] N[AGR=?a]
PP -> P[AGR=[GND='m', NUM='sg']] N[AGR=[GND='m', NUM='sg']]
PP -> P[AGR=[GND='f', NUM='sg']] D[AGR=[GND='f', NUM='sg']] N[AGR=[GND='f', NUM='sg']]
PP -> P[AGR=[GND='f', NUM='sg']] D[AGR=[GND='m', NUM='sg', DEF='i']] N[AGR=[GND='m', NUM='sg']]
PP -> P[AGR=[GND=?a, NUM='pl']] N[AGR=[GND=?a, NUM='pl']]
P[AGR=[NUM='sg', GND='m']] -> 'du' | 'au'
P[AGR=[NUM='sg', GND='f']] -> 'de' | 'à'
P[AGR=[NUM='pl']] -> 'des' | 'aux'
N[AGR=[NUM='sg', GND='m']] -> 'garcon'
N[AGR=[NUM='sg', GND='f']] -> 'fille'
N[AGR=[NUM='pl', GND='m']] -> 'garcons'
N[AGR=[NUM='pl', GND='f']] -> 'filles'
D[AGR=[NUM='sg', GND='m', DEF='i']] -> 'un'
D[AGR=[NUM='sg', GND='f', DEF='i']] -> 'une'
D[AGR=[NUM='sg', GND='m', DEF='d']] -> 'le'
D[AGR=[NUM='sg', GND='f', DEF='d']] -> 'la'
D[AGR=[NUM='pl', GND='m']] -> 'les'
D[AGR=[NUM='pl', GND='f']] -> 'les'
"""
french_grammar = grammar.FeatureGrammar.fromstring(g)
parser = parse.FeatureEarleyChartParser(french_grammar)
valid_productions = set()
for tokens in list(generate(french_grammar, n=100000)):
parsed_tokens = parser.parse(tokens)
try:
first_parse = next(parsed_tokens) # Check if there's a valid parse.
valid_productions.add(' '.join(first_parse.leaves()))
except StopIteration:
continue
for np in sorted(valid_productions):
print(np)
[OUT]:
au garcon
aux filles
aux garcons
de la fille
de un garcon
de une fille
des filles
des garcons
du garcon
la fille
le garcon
les filles
les garcons
un garcon
une fille
à la fille
à un garcon
à une fille