如何从集合B中删除集合A中单个集合项目的所有实例?

时间:2011-05-10 14:50:40

标签: python set

如下所示,当我打开test.txt并将单词放入集合时,会返回集合与common_words集合的差异。但是,它只删除common_words集中的单个实例而不是它们的所有实例。我怎样才能做到这一点?我想从title_words

中删除common_words中的所有项目实例
from string import punctuation
from operator import itemgetter

N = 10
words = {}

linestring = open('test.txt', 'r').read()

//set A, want to remove these from set B
common_words = set(("if", "but", "and", "the", "when", "use", "to", "for"))

title = linestring

//set B, want to remove ALL words in set A from this set and store in keywords
title_words = set(title.lower().split())

keywords = title_words.difference(common_words)

words_gen = (word.strip(punctuation).lower() for line in keywords
                                             for word in line.split())

for word in words_gen:
    words[word] = words.get(word, 0) + 1

top_words = sorted(words.iteritems(), key=itemgetter(1), reverse=True)[:N]

for word, frequency in top_words:
    print "%s: %d" % (word, frequency)

7 个答案:

答案 0 :(得分:1)

如果title_words是一个集合,则任何一个单词只出现一次。所以你只需要删除一个事件。我误解了你的问题吗?


我仍然对此问题感到困惑,但我注意到一个问题可能是当您通过set传递初始数据时,标点符号尚未被删除。因此,在.difference()操作中可能存在多个单词的间断版本。试试这个:

title_words = set(word.strip(punctuation) for word in title.lower().split())

此外,您的words_gen生成器的编写方式略有混乱。为什么line in keywords - 这是什么线?为什么你再次打电话给split()keywords应该是一组直言,对吗?

答案 1 :(得分:1)

我同意sentle。试试这段代码:

for common_word in common_words:
    try:
        title.words.remove(common_word)
    except:
        print "The common word %s was not in title_words" %common_word

应该这样做

希望这有帮助

答案 2 :(得分:1)

在制作套装之前将标点符号删除,您可以:

keywords = title_words.strip(punctuation).difference(common_words)

试图调用strip的{​​{1}}方法,title_wordsset只有str这种方法。你可以这样做:

for chr in punctuation:
    title = title.replace(chr, '')

title_words = set(title.lower().split())

keywords = title_words.difference(common_words)

答案 3 :(得分:1)

您只需要difference()方法,但看起来您的示例有问题。

title_words是一个集合,没有strip()方法。

请改为尝试:

title_words = set(title.lower().split())
keywords = title_words.difference(common_words)

答案 4 :(得分:1)

您已成功找到输入文件中前N个最独特的标点符号。

通过原始代码运行此输入文件:

the quick brown fox.
The quick brown fox?
The quick brown fox! 
the quick, brown fox

您将获得以下输出:

fox: 4
quick: 2
brown: 1

请注意,fox出现在4个版本中:foxfox?fox!fox.单词brown只出现一种方式。并且quick仅显示带逗号和不带逗号(2种变体)。

当我们向fox集添加common_words时会发生什么?只删除没有尾随标点符号的变体,我们留下了三个标点符号的变体,给出了这个输出:

fox: 3
quick: 2
brown: 1

有关更实际的示例,请通过您的方法运行MLK's I Have a Dream语音:

justice: 4
children: 3
today: 3
rights: 3
satisfied: 3
nation: 3
day: 3
ring: 3
hope: 3
injustice: 3

博士。 King在演讲中八次说“我有一个梦想”,但dream根本没有出现在列表中。搜索justice,你会发现四(4)个标点符号:

  • 直到“正义滚动”
  • 正义宫殿:在
  • 让正义成为现实
  • 种族正义的道路。

出了什么问题?看起来这个方法已经经历了很多返工,考虑到变量的名称似乎与他们的目的不相符。所以,让我们来处理(稍微移动一些代码,道歉):

打开文件并将整个文件放入linestring,到目前为止,除了变量名称之外:

linestring = open(filename, 'r').read()

这是一条线还是标题?都?无论如何,我们现在将整个文件小写并按空格分割。使用我的测试文件,这意味着title_words现在包含fox?fox!foxfox.

title = linestring
title_words = set(title.lower().split())

现在尝试删除常用词。我们假设我们的common_words包含fox。下一行会移除fox,但会离开fox?fox!fox.

keywords = title_words.difference(common_words)

下一行对我来说真的很遗憾,好像它的意思是for line in linestring.split('\n') for word in line.split()。在当前表单中,keywords只是一个单词列表,因此line只是一个没有空格的单词,因此for word in line.split()无效。我们只是迭代每个单词,删除标点符号,并使其小写。 words_gen现在包含3个狐狸副本:foxfoxfox。我们删除了一个没有标点的版本。

words_gen = (word.strip(punctuation).lower() for line in keywords
                                             for word in line.split())

频率分析很有意义。这将根据words_gen生成器中的单词创建直方图。这最终给了我们N个最独特的标点词!在此示例中,fox=3

words = {}
for word in words_gen:
    words[word] = words.get(word, 0) + 1
top_words = sorted(words.iteritems(), key=itemgetter(1), reverse=True)[:N]

所以出错了什么。其他人已经发布了明确的词频分析解决方案,但我有点表现心态,并想出了我自己的变体。首先,使用正则表达式将文本拆分为单词:

# 1. assumes proper word spacing after punctuation, if not, then something like
#    "I ate.I slept" will return "I", "ATEI", "SLEPT"
# 2. handles contractions properly. E.g., "don't" becomes "DONT"
# 3. removes any unexpected characters such as Unicode Non-breaking space and
#    non-printable ascii characters (MS Word inserts ASCII 0x05 for
#    in-line review comments)
clean = re.sub("[^\w\s]+", "", text.upper())
words = clean.split()

现在基于Python Performance Tips for Initializing Dictionary Entries(以及我自己的测量表现),找出前N个最常用词:

# first create a dictionary that will count the number of words. 
# using defaultdict(int) is the 2nd fastest method I measured but 
# the most readable. It was very close in speed to "if not w in freq" technique
freq = defaultdict(int)
for w in words:
    freq[w] += 1

# remove any of the common words by deleting common keys from the dictionary
for k in common_words:
    if k in freq:
        del freq[k]

# Ryan's original top-N selection was the fastest of several
# methods I tried including using dictview and lambda functions
# - sort the items by directly accessing item[1] (i.e., the value/frequency count)
top = sorted( freq.iteritems(), key=itemgetter(1), reverse=True)[:N]

关闭金博士的讲话,删除所有文章和代词:

('OF', 99)
('TO', 59)
('AND', 53)
('BE', 33)
('WE', 30)
('WILL', 27)
('THAT', 24)
('IS', 23)
('IN', 22)
('THIS', 20)

而且,对于踢球,我的表现测量:

Original                      ; 0:00:00.645000 ************
SortAllWords                  ; 0:00:00.571000 ***********
MyFind                        ; 0:00:00.870000 *****************
MyImprovedFind                ; 0:00:00.551000 ***********
DontInsertCommon              ; 0:00:00.649000 ************
JohnGainsJr                   ; 0:00:00.857000 *****************
ReturnImmediate               ; 0:00:00
SortWordsAndReverse           ; 0:00:00.572000 ***********
JustCreateDic_GetZero         ; 0:00:00.439000 ********
JustCreateDic_TryExcept       ; 0:00:00.732000 **************
JustCreateDic_IfNotIn         ; 0:00:00.309000 ******
JustCreateDic_defaultdict     ; 0:00:00.328000 ******
CreateDicAndRemoveCommon      ; 0:00:00.437000 ********

干杯, ë

答案 5 :(得分:0)

不理想,但作为一个词频计数器(这似乎是针对的目标):

from string import punctuation
from operator import itemgetter
import itertools

N = 10
words = {}

linestring = open('test.txt', 'r').read()

common_words = set(("if", "but", "and", "the", "when", "use", "to", "for"))

words = [w.strip(punctuation) for w in linestring.lower().split()]

keywords = itertools.ifilterfalse(lambda w: w in common_words, words)

words = {}
for word in keywords:
    words[word] = words.get(word, 0) + 1

top_words = sorted(words.iteritems(), key=itemgetter(1), reverse=True)[:N]

for word, frequency in top_words:
    print "%s: %d" % (word, frequency)

答案 6 :(得分:0)

我最近编写了一些类似的代码,虽然风格与你的风格截然不同。也许它会帮助你。

import string
import sys

def main():
    # get some stop words
    stopf = open('stop_words.txt', "r")
    stopwords = {}
    for s in stopf:
        stopwords[string.strip(s)] = 1

    file = open(sys.argv[1], "r")
    filedata = file.read()
    words=string.split(filedata)
    histogram = {}
    count = 0
    for word in words:
        word = string.strip(word, string.punctuation)
        word = string.lower(word)
        if word in stopwords:
            continue
        histogram[word] = histogram.get(word, 0) + 1
        count = (count+1) % 1000
        if count == 0:
            print '*',
    flist = []
    for word, count in histogram.items():
        flist.append([count, word])
    flist.sort()
    flist.reverse()
    for pair in flist[0:100]:
        print "%30s: %4d" % (pair[1], pair[0])

main()