我试图从我的语料库中提取短语我已经定义了两个规则,一个是名词后跟多个名词,其他是形容词后跟名词,这里我希望如果从两个规则中提取相同的短语,程序应该忽略第二,我面临的问题是短语仅从第一条规则中提取,第二条规则未被应用。 下面是代码:
PATTERN = r"""
NP: {<NN><NN>+}
{<ADJ><NN>*}
"""
MIN_FREQ = 1
MIN_CVAL = -13 # lowest cval -13
def __init__(self):
corpus_root = os.path.abspath('../multiwords/test')
self.corpus = nltk.corpus.reader.TaggedCorpusReader(corpus_root,'.*')
self.word_count_by_document = None
self.phrase_frequencies = None
def calculate_phrase_frequencies(self):
"""
extract the sentence chunks according to PATTERN and calculate
the frequency of chunks with pos tags
"""
# pdb.set_trace()
chunk_freq_dict = defaultdict(int)
chunker = nltk.RegexpParser(self.PATTERN)
for sent in self.corpus.tagged_sents():
sent = [s for s in sent if s[1] is not None]
for chk in chunker.parse(sent).subtrees():
if str(chk).startswith('(NP'):
phrase = chk.__unicode__()[4:-1]
if '\n' in phrase:
phrase = ' '.join(phrase.split())
just_phrase = ' '.join([w.rsplit('/', 1)[0] for w in phrase.split(' ')])
# print(just_phrase)
chunk_freq_dict[just_phrase] += 1
self.phrase_frequencies = chunk_freq_dict
#print(self.phrase_frequencies)
答案 0 :(得分:2)
首先,Python尤其是多行字符串依赖于缩进。确保字符串中没有前面的空格(因为它们将被视为字符)并确保图案(括号)在视觉上对齐。
此外,我认为您可能希望将<ADJ><NN>+
作为第二种模式。 +
表示1或更多,而*
表示0或更多。
我希望这能解决问题。
#!/usr/bin/env python
import nltk
PATTERN = r"""
NP: {<NN><NN>+}
{<ADJ><NN>+}
"""
sentence = [('the', 'DT'), ('little', 'ADJ'), ('yellow', 'ADJ'),
('shepherd', 'NN'), ('dog', 'NN'), ('barked', 'VBD'), ('at', 'IN'),
('the', 'DT'), ('silly', 'ADJ'), ('cat', 'NN')]
cp = nltk.RegexpParser(PATTERN)
print(cp.parse(sentence))
结果:
(S
the/DT
little/ADJ
yellow/ADJ
(NP shepherd/NN dog/NN)
barked/VBD
at/IN
the/DT
(NP silly/ADJ cat/NN))