我正在玩棕色语料库,特别是“新闻”中的标记句子。我发现“to”是带有最不明确的单词标签(TO,IN,TO-HL,IN-HL,IN-TL,NPS)的单词。我正在尝试编写一个代码,该代码将从与“to”相关联的每个标记的语料库中打印一个句子。句子不需要“清除”标签,而只需包含“to”和每个相关的pos标签。
brown_sents = nltk.corpus.brown.tagged_sents(categories="news")
for sent in brown_sents:
for word, tag in sent:
if (word == 'to' and tag == "IN"):
print sent
我只使用其中一个pos-tags尝试了上面的代码,看看它是否有效,但它会打印出所有这些实例。我需要它打印第一个找到的匹配单词,标签然后停止的句子。我试过这个:
for sent in brown_sents:
for word, tag in sent:
if (word == 'to' and tag == 'IN'):
print sent
if (word != 'to' and tag != 'IN'):
break
这适用于这个pos-tag,因为它是与“to”相关的第一个,但如果我使用:
for sent in brown_sents:
for word, tag in sent:
if (word == 'to' and tag == 'TO-HL'):
print sent
if (word != 'to' and tag != 'TO-HL'):
break
它什么都不返回。我想我很亲密 - 小心帮忙吗?
答案 0 :(得分:2)
您可以继续添加到当前代码中,但代码不会考虑这些内容:
如果您想坚持使用代码,请尝试以下方法:
from nltk.corpus import brown
brown_sents = brown.tagged_sents(categories="news")
def to_pos_sent(pos):
for sent in brown_sents:
for word, tag in sent:
if word == 'to' and tag == pos:
yield sent
for sent in to_pos_sent('TO'):
print sent
for sent in to_pos_sent('IN'):
print sent
我建议您将句子存储在defaultdict(list)
中,然后您可以随时检索它们。
from nltk.corpus import brown
from collections import Counter, defaultdict
sents_with_to = defaultdict(list)
to_counts = Counter()
for i, sent in enumerate(brown.tagged_sents(categories='news')):
# Check if 'to' is in sentence.
uniq_words = dict(sent)
if 'to' in uniq_words or 'To' in uniq_words:
# Iterate through the sentence to find 'to'
for word, pos in sent:
if word.lower()=='to':
# Flatten the sentence into a string
sents_with_to[pos].append(sent)
to_counts[pos]+=1
for pos in sents_with_to:
for sent in sents_with_to[pos]:
print pos, sent
访问特定POS的句子:
for sent in sents_with_to['TO']:
print sent
你会意识到,如果某个特定POS的'to'在句子中出现两次。它在sents_with_to[pos]
中记录了两次。如果要删除它们,请尝试:
sents_with_to_and_TO = set(" ".join(["#".join(word, pos) for word, pos in sent] for sent in sents_with_to['TO']))
答案 1 :(得分:1)
关于为什么这不起作用:
for sent in brown_sents:
for word, tag in sent:
if (word == 'to' and tag == 'TO-HL'):
print sent
if (word != 'to' and tag != 'TO-HL'):
break
在解释之前,您的代码并不是非常接近您想要的输出。这是因为你的if-else
陈述并没有真正做你需要的。
首先,你需要了解多种条件(即' if')的作用。
# Loop through the sentence
for sent in brown_sents:
# Loop through each word with its POS
for word, tag in sent:
# For each sentence checks whether word and tag is in sentence:
if word == 'to' and tag == 'TO-HL':
print sent # If the condition is true, print sent
# After checking the first if, you continue to check the second if
# if word is not 'to' and tag is not 'TO-HL',
# you want to break out of the sentence. Note that you are still
# in the same iteration as the previous condition.
if word != 'to' and tag != 'TO-HL':
break
现在让我们从一些基本的if-else
声明开始:
>>> from nltk.corpus import brown
>>> first_sent = brown.tagged_sents()[0]
>>> first_sent
[(u'The', u'AT'), (u'Fulton', u'NP-TL'), (u'County', u'NN-TL'), (u'Grand', u'JJ-TL'), (u'Jury', u'NN-TL'), (u'said', u'VBD'), (u'Friday', u'NR'), (u'an', u'AT'), (u'investigation', u'NN'), (u'of', u'IN'), (u"Atlanta's", u'NP$'), (u'recent', u'JJ'), (u'primary', u'NN'), (u'election', u'NN'), (u'produced', u'VBD'), (u'``', u'``'), (u'no', u'AT'), (u'evidence', u'NN'), (u"''", u"''"), (u'that', u'CS'), (u'any', u'DTI'), (u'irregularities', u'NNS'), (u'took', u'VBD'), (u'place', u'NN'), (u'.', u'.')]
>>> for word, pos in first_sent:
... if word != 'to' and pos != 'TO-HL':
... break
... else:
... print 'say hi'
...
>>>
从上面的示例中,我们遍历sentnece中的每个单词+ POS以及每个对的单词pos,if
条件将检查它是否不是单词&# 39;至'而不是pos' TO-HL'如果是这种情况,它就会破裂而且永远不会say hi
给你。
因此,如果您使用if-else
条件保留代码,则始终会在不继续循环的情况下中断,因为to
不是句子中的第一个单词且匹配的位置是不正确的。
事实上,您的if
条件正在尝试检查每个字词是否为' to'以及它的POS标签是否为“TO-HL'
您要做的是检查:
因此条件(1)所需的if
条件是:
>>> from nltk.corpus import brown
>>> three_sents = brown.tagged_sents()[:3]
>>> for sent in three_sents:
... if 'to' in dict(sent):
... print sent
...
[(u'The', u'AT'), (u'September-October', u'NP'), (u'term', u'NN'), (u'jury', u'NN'), (u'had', u'HVD'), (u'been', u'BEN'), (u'charged', u'VBN'), (u'by', u'IN'), (u'Fulton', u'NP-TL'), (u'Superior', u'JJ-TL'), (u'Court', u'NN-TL'), (u'Judge', u'NN-TL'), (u'Durwood', u'NP'), (u'Pye', u'NP'), (u'to', u'TO'), (u'investigate', u'VB'), (u'reports', u'NNS'), (u'of', u'IN'), (u'possible', u'JJ'), (u'``', u'``'), (u'irregularities', u'NNS'), (u"''", u"''"), (u'in', u'IN'), (u'the', u'AT'), (u'hard-fought', u'JJ'), (u'primary', u'NN'), (u'which', u'WDT'), (u'was', u'BEDZ'), (u'won', u'VBN'), (u'by', u'IN'), (u'Mayor-nominate', u'NN-TL'), (u'Ivan', u'NP'), (u'Allen', u'NP'), (u'Jr.', u'NP'), (u'.', u'.')]
现在您知道if 'to' in dict(sent)
会检查是否'在句子里。
然后检查条件(2):
>>> for sent in three_sents:
... if 'to' in dict(sent):
... if dict(sent)['to'] == 'TO':
... print sent
...
[(u'The', u'AT'), (u'September-October', u'NP'), (u'term', u'NN'), (u'jury', u'NN'), (u'had', u'HVD'), (u'been', u'BEN'), (u'charged', u'VBN'), (u'by', u'IN'), (u'Fulton', u'NP-TL'), (u'Superior', u'JJ-TL'), (u'Court', u'NN-TL'), (u'Judge', u'NN-TL'), (u'Durwood', u'NP'), (u'Pye', u'NP'), (u'to', u'TO'), (u'investigate', u'VB'), (u'reports', u'NNS'), (u'of', u'IN'), (u'possible', u'JJ'), (u'``', u'``'), (u'irregularities', u'NNS'), (u"''", u"''"), (u'in', u'IN'), (u'the', u'AT'), (u'hard-fought', u'JJ'), (u'primary', u'NN'), (u'which', u'WDT'), (u'was', u'BEDZ'), (u'won', u'VBN'), (u'by', u'IN'), (u'Mayor-nominate', u'NN-TL'), (u'Ivan', u'NP'), (u'Allen', u'NP'), (u'Jr.', u'NP'), (u'.', u'.')]
>>> for sent in three_sents:
... if 'to' in dict(sent):
... if dict(sent)['to'] == 'TO-HL':
... print sent
...
>>>
现在您看到if dict(sent)['to'] == 'TO-HL'
AFTER 您已检查if 'to' in dict(sent)
控制条件以检查pos限制。
但是你意识到如果你有2'在句子dict(sent)['to']
中只捕获最后一个'的POS。这就是为什么你需要前面答案中建议的defaultdict(list)
。
实际上没有干净的方法来执行检查,最有效的方法是描述上一个答案,叹息。