Question

我是NLP和NLTK的新手，我想找到含糊不清的词，意思是至少有n个不同标签的词。我有这种方法，但输出不仅令人困惑。

代码：

def MostAmbiguousWords(words, n):
# wordsUniqeTags holds a list of uniqe tags that have been observed for a given word
wordsUniqeTags = {}
for (w,t) in words:
    if wordsUniqeTags.has_key(w):
        wordsUniqeTags[w] = wordsUniqeTags[w] | set(t)
    else:
        wordsUniqeTags[w] = set([t])
# Starting to count
res = []
for w in wordsUniqeTags:
    if len(wordsUniqeTags[w]) >= n:
        res.append((w, wordsUniqeTags[w]))

return res
MostAmbiguousWords(brown.tagged_words(), 13)

输出：

[("what's", set(['C', 'B', 'E', 'D', 'H', 'WDT+BEZ', '-', 'N', 'T', 'W', 'V', 'Z', '+'])),
("who's", set(['C', 'B', 'E', 'WPS+BEZ', 'H', '+', '-', 'N', 'P', 'S', 'W', 'V', 'Z'])),
("that's", set(['C', 'B', 'E', 'D', 'H', '+', '-', 'N', 'DT+BEZ', 'P', 'S', 'T', 'W', 'V', 'Z'])),
('that', set(['C', 'D', 'I', 'H', '-', 'L', 'O', 'N', 'Q', 'P', 'S', 'T', 'W', 'CS']))]

现在我不知道B，C，Q等等。可以代表。所以，我的问题：

这些是什么？
他们是什么意思？（如果它们是标签）
我认为它们不是标记，因为who和whats没有WH标记表示“wh question words”。

如果有人发布包含所有可能标签及其含义的映射的链接，我会很高兴。

Answer 1

看起来你有一个错字。在这一行：

wordsUniqeTags[w] = wordsUniqeTags[w] | set(t)

您应该set([t])（而不是set(t)），就像在else案例中一样。

这解释了您所看到的行为，因为t是一个字符串，而set(t)正在对字符串中的每个字符进行设置。你想要的是set([t])，它创建一个以t为元素的集合。

>>> t = 'WHQ'
>>> set(t)
set(['Q', 'H', 'W'])    # bad
>>> set([t])
set(['WHQ'])            # good

顺便说一句，你可以通过将该行改为：

来纠正问题和简化问题。

wordsUniqeTags[w].add(t)

但是，实际上，您应该使用setdefault上的dict方法和列表理解语法来改进整体方法。所以试试这个：

def most_ambiguous_words(words, n):
  # wordsUniqeTags holds a list of uniqe tags that have been observed for a given word
  wordsUniqeTags = {}
  for (w,t) in words:
    wordsUniqeTags.setdefault(w, set()).add(t)
  # Starting to count
  return [(word,tags) for word,tags in wordsUniqeTags.iteritems() if len(tags) >= n]

Answer 2

您正在将此行中的POS标记拆分为单个字符：

    wordsUniqeTags[w] = wordsUniqeTags[w] | set(t)

set('AT')会产生set(['A', 'T'])。

Answer 3

如何在集合模块中使用Counter和defaultdict功能？

from collection import defaultdict, Counter

def most_ambiguous_words(words, n):
    counts = defaultdict(Counter)
    for (word,tag) in words:
        counts[word][tag] += 1
    return [(w, counts[w].keys()) for w in counts if len(counts[word]) > n]

这是标签列表还是其他什么？

3 个答案: