Question

如下所示，当我打开test.txt并将单词放入集合时，会返回集合与common_words集合的差异。但是，它只删除common_words集中的单个实例而不是它们的所有实例。我怎样才能做到这一点？我想从title_words

中删除common_words中的所有项目实例

from string import punctuation
from operator import itemgetter

N = 10
words = {}

linestring = open('test.txt', 'r').read()

//set A, want to remove these from set B
common_words = set(("if", "but", "and", "the", "when", "use", "to", "for"))

title = linestring

//set B, want to remove ALL words in set A from this set and store in keywords
title_words = set(title.lower().split())

keywords = title_words.difference(common_words)

words_gen = (word.strip(punctuation).lower() for line in keywords
                                             for word in line.split())

for word in words_gen:
    words[word] = words.get(word, 0) + 1

top_words = sorted(words.iteritems(), key=itemgetter(1), reverse=True)[:N]

for word, frequency in top_words:
    print "%s: %d" % (word, frequency)

Answer 1

如果title_words是一个集合，则任何一个单词只出现一次。所以你只需要删除一个事件。我误解了你的问题吗？

我仍然对此问题感到困惑，但我注意到一个问题可能是当您通过set传递初始数据时，标点符号尚未被删除。因此，在.difference()操作中可能存在多个单词的间断版本。试试这个：

title_words = set(word.strip(punctuation) for word in title.lower().split())

此外，您的words_gen生成器的编写方式略有混乱。为什么line in keywords - 这是什么线？为什么你再次打电话给split()？ keywords应该是一组直言，对吗？

Answer 2

我同意sentle。试试这段代码：

for common_word in common_words:
    try:
        title.words.remove(common_word)
    except:
        print "The common word %s was not in title_words" %common_word

应该这样做

希望这有帮助

Answer 3

在制作套装之前将标点符号删除，您可以：

keywords = title_words.strip(punctuation).difference(common_words)

试图调用strip的{{1}}方法，title_words（set只有str这种方法。你可以这样做：

for chr in punctuation:
    title = title.replace(chr, '')

title_words = set(title.lower().split())

keywords = title_words.difference(common_words)

Answer 4

您只需要difference()方法，但看起来您的示例有问题。

title_words是一个集合，没有strip()方法。

请改为尝试：

title_words = set(title.lower().split())
keywords = title_words.difference(common_words)

Answer 5

您已成功找到输入文件中前N个最独特的标点符号。

通过原始代码运行此输入文件：

the quick brown fox.
The quick brown fox?
The quick brown fox! 
the quick, brown fox

您将获得以下输出：

fox: 4
quick: 2
brown: 1

请注意，fox出现在4个版本中：fox，fox?，fox!和fox.单词brown只出现一种方式。并且quick仅显示带逗号和不带逗号（2种变体）。

当我们向fox集添加common_words时会发生什么？只删除没有尾随标点符号的变体，我们留下了三个标点符号的变体，给出了这个输出：

fox: 3
quick: 2
brown: 1

有关更实际的示例，请通过您的方法运行MLK's I Have a Dream语音：

justice: 4
children: 3
today: 3
rights: 3
satisfied: 3
nation: 3
day: 3
ring: 3
hope: 3
injustice: 3

博士。 King在演讲中八次说“我有一个梦想”，但dream根本没有出现在列表中。搜索justice，你会发现四（4）个标点符号：

直到“正义滚动”
正义宫殿：在
让正义成为现实
种族正义的道路。

出了什么问题？看起来这个方法已经经历了很多返工，考虑到变量的名称似乎与他们的目的不相符。所以，让我们来处理（稍微移动一些代码，道歉）：

打开文件并将整个文件放入linestring，到目前为止，除了变量名称之外：

linestring = open(filename, 'r').read()

这是一条线还是标题？都？无论如何，我们现在将整个文件小写并按空格分割。使用我的测试文件，这意味着title_words现在包含fox?，fox!，fox和fox.

title = linestring
title_words = set(title.lower().split())

现在尝试删除常用词。我们假设我们的common_words包含fox。下一行会移除fox，但会离开fox?，fox!和fox.

keywords = title_words.difference(common_words)

下一行对我来说真的很遗憾，好像它的意思是for line in linestring.split('\n') for word in line.split()。在当前表单中，keywords只是一个单词列表，因此line只是一个没有空格的单词，因此for word in line.split()无效。我们只是迭代每个单词，删除标点符号，并使其小写。 words_gen现在包含3个狐狸副本：fox，fox，fox。我们删除了一个没有标点的版本。

words_gen = (word.strip(punctuation).lower() for line in keywords
                                             for word in line.split())

频率分析很有意义。这将根据words_gen生成器中的单词创建直方图。这最终给了我们N个最独特的标点词！在此示例中，fox=3：

words = {}
for word in words_gen:
    words[word] = words.get(word, 0) + 1
top_words = sorted(words.iteritems(), key=itemgetter(1), reverse=True)[:N]

所以出错了什么。其他人已经发布了明确的词频分析解决方案，但我有点表现心态，并想出了我自己的变体。首先，使用正则表达式将文本拆分为单词：

# 1. assumes proper word spacing after punctuation, if not, then something like
#    "I ate.I slept" will return "I", "ATEI", "SLEPT"
# 2. handles contractions properly. E.g., "don't" becomes "DONT"
# 3. removes any unexpected characters such as Unicode Non-breaking space and
#    non-printable ascii characters (MS Word inserts ASCII 0x05 for
#    in-line review comments)
clean = re.sub("[^\w\s]+", "", text.upper())
words = clean.split()

现在基于Python Performance Tips for Initializing Dictionary Entries（以及我自己的测量表现），找出前N个最常用词：

# first create a dictionary that will count the number of words. 
# using defaultdict(int) is the 2nd fastest method I measured but 
# the most readable. It was very close in speed to "if not w in freq" technique
freq = defaultdict(int)
for w in words:
    freq[w] += 1

# remove any of the common words by deleting common keys from the dictionary
for k in common_words:
    if k in freq:
        del freq[k]

# Ryan's original top-N selection was the fastest of several
# methods I tried including using dictview and lambda functions
# - sort the items by directly accessing item[1] (i.e., the value/frequency count)
top = sorted( freq.iteritems(), key=itemgetter(1), reverse=True)[:N]

关闭金博士的讲话，删除所有文章和代词：

('OF', 99)
('TO', 59)
('AND', 53)
('BE', 33)
('WE', 30)
('WILL', 27)
('THAT', 24)
('IS', 23)
('IN', 22)
('THIS', 20)

而且，对于踢球，我的表现测量：

Original                      ; 0:00:00.645000 ************
SortAllWords                  ; 0:00:00.571000 ***********
MyFind                        ; 0:00:00.870000 *****************
MyImprovedFind                ; 0:00:00.551000 ***********
DontInsertCommon              ; 0:00:00.649000 ************
JohnGainsJr                   ; 0:00:00.857000 *****************
ReturnImmediate               ; 0:00:00
SortWordsAndReverse           ; 0:00:00.572000 ***********
JustCreateDic_GetZero         ; 0:00:00.439000 ********
JustCreateDic_TryExcept       ; 0:00:00.732000 **************
JustCreateDic_IfNotIn         ; 0:00:00.309000 ******
JustCreateDic_defaultdict     ; 0:00:00.328000 ******
CreateDicAndRemoveCommon      ; 0:00:00.437000 ********

干杯， ë

Answer 6

不理想，但作为一个词频计数器（这似乎是针对的目标）：

from string import punctuation
from operator import itemgetter
import itertools

N = 10
words = {}

linestring = open('test.txt', 'r').read()

common_words = set(("if", "but", "and", "the", "when", "use", "to", "for"))

words = [w.strip(punctuation) for w in linestring.lower().split()]

keywords = itertools.ifilterfalse(lambda w: w in common_words, words)

words = {}
for word in keywords:
    words[word] = words.get(word, 0) + 1

top_words = sorted(words.iteritems(), key=itemgetter(1), reverse=True)[:N]

for word, frequency in top_words:
    print "%s: %d" % (word, frequency)

Answer 7

我最近编写了一些类似的代码，虽然风格与你的风格截然不同。也许它会帮助你。

import string
import sys

def main():
    # get some stop words
    stopf = open('stop_words.txt', "r")
    stopwords = {}
    for s in stopf:
        stopwords[string.strip(s)] = 1

    file = open(sys.argv[1], "r")
    filedata = file.read()
    words=string.split(filedata)
    histogram = {}
    count = 0
    for word in words:
        word = string.strip(word, string.punctuation)
        word = string.lower(word)
        if word in stopwords:
            continue
        histogram[word] = histogram.get(word, 0) + 1
        count = (count+1) % 1000
        if count == 0:
            print '*',
    flist = []
    for word, count in histogram.items():
        flist.append([count, word])
    flist.sort()
    flist.reverse()
    for pair in flist[0:100]:
        print "%30s: %4d" % (pair[1], pair[0])

main()

如何从集合B中删除集合A中单个集合项目的所有实例？

7 个答案: