在使用StanfordNLP
,StanfordParser
和Tregex
解析文本时,我想确定一种特定的模式。可以使用nltk
和nltk.RegexpParser
产生所需的输出,如下所示:
nltk
代码:
from nltk import word_tokenize, pos_tag
text = "New developments in the science of motion picture photography are not
abundant at this advanced stage of the game"
cp_pattern = r"""CP: {<NN|JJ|><NN|JJ>}"""
parser = nltk.RegexpParser(cp_pattern)
tree = parser.parse(pos_tag(word_tokenize(text)))
for subtree in tree.subtrees():
if subtree.label() == 'CP':
print(str(subtree))
和输出:
(CP motion/NN picture/NN)
(CP advanced/JJ stage/NN)
以下代码使用StanfordParser
标记和解析text
:
from nltk.parse.stanford import StanfordParser
parse_jar = 'path to stanford parser jar'
parse_model = 'path to stanford parser model'
parser_st=StanfordParser(path_to_jar=parse_jar,path_to_models_jar=parse_model)
parse_tree = list(parser_st.raw_parse(text))
print(parse_tree)
打印parse_tree
将为我提供以下输出:
[Tree('ROOT', [Tree('S', [Tree('NP', [Tree('NP', [Tree('NNP', ['New']), Tree('NNS', ['developments'])]), Tree('PP', [Tree('IN', ['in']), Tree('NP', [Tree('NP', [Tree('DT', ['the']), Tree('NN', ['science'])]), Tree('PP', [Tree('IN', ['of']), Tree('NP', [Tree('NN', ['motion']), Tree('NN', ['picture']), Tree('NN', ['photography'])])])])])]), Tree('VP', [Tree('VBP', ['are']), Tree('RB', ['not']), Tree('ADJP', [Tree('JJ', ['abundant']), Tree('PP', [Tree('IN', ['at']), Tree('NP', [Tree('NP', [Tree('DT', ['this']), Tree('VBN', ['advanced']), Tree('NN', ['stage'])]), Tree('PP', [Tree('IN', ['of']), Tree('NP', [Tree('DT', ['the']), Tree('NN', ['game'])])])])])])])])])]
现在,我想知道如何在cp_pattern
中定义自己想要的模式StanfordParser
并像使用nltk
一样识别它吗?