Question

我有以下示例数据

docs_word = ["this is a test", "this is another test"]
docs_txt = ["this is a great test", "this is another test"]

我现在要做的是在示例文件中创建单词的两个词典，比较它们并存储docs_txt文件中的单词，而不是单独的字典中的docs_word文件中的单词。因此我写了以下内容：

count_txtDoc = Counter()
for file in docs_word:
  words = file.split(" ")
  count_txtDoc.update(words)

count_wrdDoc = Counter()
for file in docs_txt:
  words = file.split(" ")
  count_wrdDoc.update(words)

#Create a list of the dictionary keys
words_worddoc = count_wrdDoc.keys()
words_txtdoc = count_txtDoc.keys()

#Look for values that are in word_doc but not in txt_doc

count_all = Counter()
for val in words_worddoc:
  if val not in words_txtdoc:
   count_all.update(val)
   print(val)

现在的事情是打印正确的值。它显示：“很棒”。

但是，如果我打印：

print(count_all)

我得到以下输出：

Counter({'a': 1, 'r': 1, 'e': 1, 't': 1, 'g': 1})

虽然我期待

Counter({'great': 1})

有关如何实现这一目标的任何想法？＃打印（count_all）

Answer 1

使用包含单词的迭代来更新计数器，而不是单词本身（因为单词也是可迭代的）：

count_all.update([val])
#                ^   ^

但是，如果您只是项目，则可能不需要创建新的计数器。您可以采用键的对称差异：

words_worddoc = count_wrdDoc.viewkeys() # use .keys() in Py3
words_txtdoc = count_txtDoc.viewkeys()  # use .keys() in Py3

print(words_txtdoc ^ words_worddoc)
# set(['great'])

如果您还想要计数，您可以计算两个计数器之间的对称差异，如下所示：

count_all = (count_wrdDoc - count_txtDoc) | (count_txtDoc - count_wrdDoc)

print (count_all)
# Counter({'great': 1})

创建一个文档中的单词字典，但不在另一个文档中

1 个答案: