创建一个文档中的单词字典,但不在另一个文档中

时间:2017-03-29 09:47:49

标签: python

我有以下示例数据

docs_word = ["this is a test", "this is another test"]
docs_txt = ["this is a great test", "this is another test"]

我现在要做的是在示例文件中创建单词的两个词典,比较它们并存储docs_txt文件中的单词,而不是单独的字典中的docs_word文件中的单词。因此我写了以下内容:

count_txtDoc = Counter()
for file in docs_word:
  words = file.split(" ")
  count_txtDoc.update(words)

count_wrdDoc = Counter()
for file in docs_txt:
  words = file.split(" ")
  count_wrdDoc.update(words)

#Create a list of the dictionary keys
words_worddoc = count_wrdDoc.keys()
words_txtdoc = count_txtDoc.keys()

#Look for values that are in word_doc but not in txt_doc

count_all = Counter()
for val in words_worddoc:
  if val not in words_txtdoc:
   count_all.update(val)
   print(val)

现在的事情是打印正确的值。它显示:“很棒”。

但是,如果我打印:

print(count_all)

我得到以下输出:

Counter({'a': 1, 'r': 1, 'e': 1, 't': 1, 'g': 1})

虽然我期待

Counter({'great': 1})

有关如何实现这一目标的任何想法? # 打印(count_all)

1 个答案:

答案 0 :(得分:1)

使用包含单词的迭代来更新计数器,而不是单词本身(因为单词也是可迭代的):

count_all.update([val])
#                ^   ^ 

但是,如果您只是项目,则可能不需要创建新的计数器。您可以采用键的对称差异:

words_worddoc = count_wrdDoc.viewkeys() # use .keys() in Py3
words_txtdoc = count_txtDoc.viewkeys()  # use .keys() in Py3

print(words_txtdoc ^ words_worddoc)
# set(['great'])

如果您还想要计数,您可以计算两个计数器之间的对称差异,如下所示:

count_all = (count_wrdDoc - count_txtDoc) | (count_txtDoc - count_wrdDoc)

print (count_all)
# Counter({'great': 1})