Python中具有特征结构的上下文无关语法

时间:2018-01-02 01:42:28

标签: python-3.x nlp nltk context-free-grammar

我试图用python从定义的语法生成句子,以避免协议问题我使用了特征结构,

这是我到目前为止所做的代码:

>>> from __future__ import print_function
   >>> import nltk
   >>> from nltk.featstruct import FeatStruct
   >>> from nltk import grammar, parse
   >>> from nltk.parse.generate import generate
   >>> from nltk import CFG
   >>> g = """
    % start DP
    DP-> D[AGR=[NUM='sg', PERS=3, GND='m']] N[AGR=[NUM='sg', GND='m']]
    D[AGR=[NUM='sg', PERS=3, GND='f']] -> 'une' | 'la'
    D[AGR=[NUM='sg', PERS=3, GND='m']] -> 'un' | 'le'
    D[AGR=[NUM='pl', PERS=3]] -> 'des' | 'les'
    N[AGR=[NUM='sg', GND='m']] -> 'garçon'
    N[AGR=[NUM='pl', GND='m']] -> 'garçons'
    N[AGR=[NUM='sg', GND='f']] -> 'fille'
    N[AGR=[NUM='pl', GND='f']] -> 'filles'
    """
        >>> for sentence in generate(grammar, n=30):
            print(''.join(sentence))

这是我得到的输出:

unegarçon
unegarçons
unefille
unefilles
lagarçon
lagarçons
lafille
lafilles
ungarçon
ungarçons
unfille
unfilles
legarçon
legarçons
lefille
lefilles
desgarçon
desgarçons
desfille
desfilles
lesgarçon
lesgarçons
lesfille
lesfilles

虽然我应该有这样的输出:

un garçon
le garçon

我遇到的问题是:

  1. 协议没有成功,我的判决不符合协议

  2. 句子中的两个单词之间没有空格。

  3. 我看不到什么?

1 个答案:

答案 0 :(得分:2)

让我们先解决问题的简单部分。

Q2。句子中的两个单词之间没有空格。

在打印方面你很接近=)

问题在于你如何使用str.join功能。

>>> list_of_str = ['a', 'b', 'c']
>>> ''.join(list_of_str)
'abc'
>>> ' '.join(list_of_str)
'a b c'
>>> '|'.join(list_of_str)
'a|b|c'

Q1。协议没有成功,我的判决不符合协议

第一个警告标志

要生成具有协议的特征结构语法,应该有一条规则在右侧包含D[AGR=?a] N[AGR=?a](RHS),例如

NP -> D[AGR=?a] N[AGR=?a] 

如果遗漏了语法中没有真正的协议规则,请参阅http://www.nltk.org/howto/featgram.html

现在来了!

如果我们仔细查看nltk.parse.generate代码,它只会产生终端的所有可能组合,而且看起来它并不关心特征结​​构:https://github.com/nltk/nltk/blob/develop/nltk/parse/generate.py

(我认为这不是一个特征,所以将问题提交到NLTK存储库会很好)

因此,如果我们这样做,它将打印所有可能终端的组合(无需遵守协议):

from nltk import grammar, parse
from nltk.parse.generate import generate

# If person is always 3rd, we can skip the PERSON feature.
g = """
DP -> D[AGR=?a] N[AGR=?a] 
N[AGR=[NUM='sg', GND='m']] -> 'garcon'
N[AGR=[NUM='sg', GND='f']] -> 'fille'
D[AGR=[NUM='sg', GND='m']] -> 'un'
D[AGR=[NUM='sg', GND='f']] -> 'une'

"""

grammar =  grammar.FeatureGrammar.fromstring(g)

print(list(generate(grammar, n=30)))

[OUT]:

[['un', 'garcon'], ['un', 'fille'], ['une', 'garcon'], ['une', 'fille']]

但是如果我们试图解析有效和无效的句子,协议规则就会出现:

from nltk import grammar, parse
from nltk.parse.generate import generate

g = """
DP -> D[AGR=?a] N[AGR=?a] 
N[AGR=[NUM='sg', GND='m']] -> 'garcon'
N[AGR=[NUM='sg', GND='f']] -> 'fille'
D[AGR=[NUM='sg', GND='m']] -> 'un'
D[AGR=[NUM='sg', GND='f']] -> 'une'

"""

grammar =  grammar.FeatureGrammar.fromstring(g)

parser = parse.FeatureEarleyChartParser(grammar)

trees = parser.parse('une garcon'.split()) # Invalid sentence.
print ("Parses for 'une garcon':", list(trees)) 

trees = parser.parse('un garcon'.split()) # Valid sentence.
print ("Parses for 'un garcon':", list(trees)) 

[OUT]:

Parses for 'une garcon': []
Parses for 'un garcon': [Tree(DP[], [Tree(D[AGR=[GND='m', NUM='sg']], ['un']), Tree(N[AGR=[GND='m', NUM='sg']], ['garcon'])])]

为了在生成时实现协议规则,一种可能的解决方案是解析每个生成的生产并保留可解析的生产,例如

from nltk import grammar, parse
from nltk.parse.generate import generate

g = """
DP -> D[AGR=?a] N[AGR=?a] 
N[AGR=[NUM='sg', GND='m']] -> 'garcon'
N[AGR=[NUM='sg', GND='f']] -> 'fille'
D[AGR=[NUM='sg', GND='m']] -> 'un'
D[AGR=[NUM='sg', GND='f']] -> 'une'

"""

grammar =  grammar.FeatureGrammar.fromstring(g)
parser = parse.FeatureEarleyChartParser(grammar)

for tokens in list(generate(grammar, n=30)):
    parsed_tokens = parser.parse(tokens)
    try: 
        first_parse = next(parsed_tokens) # Check if there's a valid parse.
        print(' '.join(first_parse.leaves()))
    except StopIteration:
        continue

[OUT]:

un garcon
une fille

我的目标是产生最后一列:

enter image description here

没有介词:

from nltk import grammar, parse
from nltk.parse.generate import generate

g = """
DP -> D[AGR=?a] N[AGR=?a] 

N[AGR=[NUM='sg', GND='m']] -> 'garcon'
N[AGR=[NUM='sg', GND='f']] -> 'fille'

N[AGR=[NUM='pl', GND='m']] -> 'garcons'
N[AGR=[NUM='pl', GND='f']] -> 'filles'

D[AGR=[NUM='sg', GND='m']] -> 'un'
D[AGR=[NUM='sg', GND='f']] -> 'une'

D[AGR=[NUM='sg', GND='m']] -> 'le'
D[AGR=[NUM='sg', GND='f']] -> 'la'

D[AGR=[NUM='pl', GND='m']] -> 'les'
D[AGR=[NUM='pl', GND='f']] -> 'les'


"""

grammar =  grammar.FeatureGrammar.fromstring(g)
parser = parse.FeatureEarleyChartParser(grammar)

valid_productions = set()

for tokens in list(generate(grammar, n=30)):
    parsed_tokens = parser.parse(tokens)
    try: 
        first_parse = next(parsed_tokens) # Check if there's a valid parse.
        valid_productions.add(' '.join(first_parse.leaves()))
    except StopIteration:
        continue

for np in sorted(valid_productions):
    print(np)

[OUT]:

la fille
le garcon
les filles
les garcons
un garcon
une fille

现在要包含介词

语法的TOP(又称START)必须有多个分支,目前DP -> D[AGR=?a] N[AGR=?a]规则在TOP,以允许PP构造,我们需要类似的东西PHRASE -> DP | PP并将PHRASE非终端设为新TOP,例如

from nltk import grammar, parse
from nltk.parse.generate import generate

g = """

PHRASE -> DP | PP 

DP -> D[AGR=?a] N[AGR=?a] 
PP -> P[AGR=?a] N[AGR=?a] 

P[AGR=[NUM='sg', GND='m']] -> 'du' | 'au'

N[AGR=[NUM='sg', GND='m']] -> 'garcon'
N[AGR=[NUM='sg', GND='f']] -> 'fille'

N[AGR=[NUM='pl', GND='m']] -> 'garcons'
N[AGR=[NUM='pl', GND='f']] -> 'filles'

D[AGR=[NUM='sg', GND='m']] -> 'un'
D[AGR=[NUM='sg', GND='f']] -> 'une'

D[AGR=[NUM='sg', GND='m']] -> 'le'
D[AGR=[NUM='sg', GND='f']] -> 'la'

D[AGR=[NUM='pl', GND='m']] -> 'les'
D[AGR=[NUM='pl', GND='f']] -> 'les'

"""

french_grammar =  grammar.FeatureGrammar.fromstring(g)
parser = parse.FeatureEarleyChartParser(french_grammar)

valid_productions = set()

for tokens in list(generate(french_grammar, n=100)):
    parsed_tokens = parser.parse(tokens)
    try: 
        first_parse = next(parsed_tokens) # Check if there's a valid parse.
        valid_productions.add(' '.join(first_parse.leaves()))
    except StopIteration:
        continue

for np in sorted(valid_productions):
    print(np)

[OUT]:

au garcon
du garcon
la fille
le garcon
les filles
les garcons
un garcon
une fille

要获得表格中的所有内容:

from nltk import grammar, parse
from nltk.parse.generate import generate

g = """

PHRASE -> DP | PP 

DP -> D[AGR=?a] N[AGR=?a] 
PP -> P[AGR=[GND='m', NUM='sg']] N[AGR=[GND='m', NUM='sg']]
PP -> P[AGR=[GND='f', NUM='sg']] D[AGR=[GND='f', NUM='sg', DEF='d']] N[AGR=[GND='f', NUM='sg']]
PP -> P[AGR=[GND=?a, NUM='pl']] N[AGR=[GND=?a, NUM='pl']]


P[AGR=[NUM='sg', GND='m']] -> 'du' | 'au'
P[AGR=[NUM='sg', GND='f']] -> 'de' | 'à'
P[AGR=[NUM='pl']] -> 'des' | 'aux'


N[AGR=[NUM='sg', GND='m']] -> 'garcon'
N[AGR=[NUM='sg', GND='f']] -> 'fille'

N[AGR=[NUM='pl', GND='m']] -> 'garcons'
N[AGR=[NUM='pl', GND='f']] -> 'filles'

D[AGR=[NUM='sg', GND='m', DEF='i']] -> 'un'
D[AGR=[NUM='sg', GND='f', DEF='i']] -> 'une'

D[AGR=[NUM='sg', GND='m', DEF='d']] -> 'le'
D[AGR=[NUM='sg', GND='f', DEF='d']] -> 'la'

D[AGR=[NUM='pl', GND='m']] -> 'les'
D[AGR=[NUM='pl', GND='f']] -> 'les'



"""

french_grammar =  grammar.FeatureGrammar.fromstring(g)
parser = parse.FeatureEarleyChartParser(french_grammar)

valid_productions = set()

for tokens in list(generate(french_grammar, n=100000)):
    parsed_tokens = parser.parse(tokens)
    try: 
        first_parse = next(parsed_tokens) # Check if there's a valid parse.
        valid_productions.add(' '.join(first_parse.leaves()))
    except StopIteration:
        continue

for np in sorted(valid_productions):
    print(np)

[OUT]:

au garcon
aux filles
aux garcons
de la fille
des filles
des garcons
du garcon
la fille
le garcon
les filles
les garcons
un garcon
une fille
à la fille

超越表

也可以产生de|a un(e) garcon|fille,即

  • de un garcon
  • de une fille
  • a un garcon
  • a une fille

但是我不确定它们是否是有效的法语短语,但是如果它们是你可以指定女性单数PP规则并删除DEF功能:

PP -> P[AGR=[GND='f', NUM='sg']] D[AGR=[GND='f', NUM='sg', DEF='d']] N[AGR=[GND='f', NUM='sg']]

为:

PP -> P[AGR=[GND='f', NUM='sg']] D[AGR=[GND='f', NUM='sg']] N[AGR=[GND='f', NUM='sg']]

然后添加一个额外的规则来产生男性奇异不定PP

PP -> P[AGR=[GND='f', NUM='sg']] D[AGR=[GND='m', NUM='sg', DEF='i']] N[AGR=[GND='m', NUM='sg']]

TL; DR

from nltk import grammar, parse
from nltk.parse.generate import generate

g = """

PHRASE -> DP | PP 

DP -> D[AGR=?a] N[AGR=?a] 
PP -> P[AGR=[GND='m', NUM='sg']] N[AGR=[GND='m', NUM='sg']]
PP -> P[AGR=[GND='f', NUM='sg']] D[AGR=[GND='f', NUM='sg']] N[AGR=[GND='f', NUM='sg']]
PP -> P[AGR=[GND='f', NUM='sg']] D[AGR=[GND='m', NUM='sg', DEF='i']] N[AGR=[GND='m', NUM='sg']]
PP -> P[AGR=[GND=?a, NUM='pl']] N[AGR=[GND=?a, NUM='pl']]


P[AGR=[NUM='sg', GND='m']] -> 'du' | 'au'
P[AGR=[NUM='sg', GND='f']] -> 'de' | 'à'
P[AGR=[NUM='pl']] -> 'des' | 'aux'


N[AGR=[NUM='sg', GND='m']] -> 'garcon'
N[AGR=[NUM='sg', GND='f']] -> 'fille'

N[AGR=[NUM='pl', GND='m']] -> 'garcons'
N[AGR=[NUM='pl', GND='f']] -> 'filles'

D[AGR=[NUM='sg', GND='m', DEF='i']] -> 'un'
D[AGR=[NUM='sg', GND='f', DEF='i']] -> 'une'

D[AGR=[NUM='sg', GND='m', DEF='d']] -> 'le'
D[AGR=[NUM='sg', GND='f', DEF='d']] -> 'la'

D[AGR=[NUM='pl', GND='m']] -> 'les'
D[AGR=[NUM='pl', GND='f']] -> 'les'



"""

french_grammar =  grammar.FeatureGrammar.fromstring(g)
parser = parse.FeatureEarleyChartParser(french_grammar)

valid_productions = set()

for tokens in list(generate(french_grammar, n=100000)):
    parsed_tokens = parser.parse(tokens)
    try: 
        first_parse = next(parsed_tokens) # Check if there's a valid parse.
        valid_productions.add(' '.join(first_parse.leaves()))
    except StopIteration:
        continue

for np in sorted(valid_productions):
    print(np)

[OUT]:

au garcon
aux filles
aux garcons
de la fille
de un garcon
de une fille
des filles
des garcons
du garcon
la fille
le garcon
les filles
les garcons
un garcon
une fille
à la fille
à un garcon
à une fille