如下所示,当我打开test.txt并将单词放入集合时,会返回集合与common_words集合的差异。但是,它只删除common_words集中的单个实例而不是它们的所有实例。我怎样才能做到这一点?我想从title_words
中删除common_words中的所有项目实例from string import punctuation
from operator import itemgetter
N = 10
words = {}
linestring = open('test.txt', 'r').read()
//set A, want to remove these from set B
common_words = set(("if", "but", "and", "the", "when", "use", "to", "for"))
title = linestring
//set B, want to remove ALL words in set A from this set and store in keywords
title_words = set(title.lower().split())
keywords = title_words.difference(common_words)
words_gen = (word.strip(punctuation).lower() for line in keywords
for word in line.split())
for word in words_gen:
words[word] = words.get(word, 0) + 1
top_words = sorted(words.iteritems(), key=itemgetter(1), reverse=True)[:N]
for word, frequency in top_words:
print "%s: %d" % (word, frequency)
答案 0 :(得分:1)
如果title_words
是一个集合,则任何一个单词只出现一次。所以你只需要删除一个事件。我误解了你的问题吗?
我仍然对此问题感到困惑,但我注意到一个问题可能是当您通过set
传递初始数据时,标点符号尚未被删除。因此,在.difference()
操作中可能存在多个单词的间断版本。试试这个:
title_words = set(word.strip(punctuation) for word in title.lower().split())
此外,您的words_gen
生成器的编写方式略有混乱。为什么line in keywords
- 这是什么线?为什么你再次打电话给split()
? keywords
应该是一组直言,对吗?
答案 1 :(得分:1)
我同意sentle。试试这段代码:
for common_word in common_words:
try:
title.words.remove(common_word)
except:
print "The common word %s was not in title_words" %common_word
应该这样做
希望这有帮助
答案 2 :(得分:1)
在制作套装之前将标点符号删除,您可以:
keywords = title_words.strip(punctuation).difference(common_words)
试图调用strip
的{{1}}方法,title_words
(set
只有str
这种方法。你可以这样做:
for chr in punctuation:
title = title.replace(chr, '')
title_words = set(title.lower().split())
keywords = title_words.difference(common_words)
答案 3 :(得分:1)
您只需要difference()
方法,但看起来您的示例有问题。
title_words
是一个集合,没有strip()
方法。
请改为尝试:
title_words = set(title.lower().split())
keywords = title_words.difference(common_words)
答案 4 :(得分:1)
您已成功找到输入文件中前N个最独特的标点符号。
通过原始代码运行此输入文件:
the quick brown fox.
The quick brown fox?
The quick brown fox!
the quick, brown fox
您将获得以下输出:
fox: 4
quick: 2
brown: 1
请注意,fox
出现在4个版本中:fox
,fox?
,fox!
和fox.
单词brown
只出现一种方式。并且quick
仅显示带逗号和不带逗号(2种变体)。
当我们向fox
集添加common_words
时会发生什么?只删除没有尾随标点符号的变体,我们留下了三个标点符号的变体,给出了这个输出:
fox: 3
quick: 2
brown: 1
有关更实际的示例,请通过您的方法运行MLK's I Have a Dream语音:
justice: 4
children: 3
today: 3
rights: 3
satisfied: 3
nation: 3
day: 3
ring: 3
hope: 3
injustice: 3
博士。 King在演讲中八次说“我有一个梦想”,但dream
根本没有出现在列表中。搜索justice
,你会发现四(4)个标点符号:
出了什么问题?看起来这个方法已经经历了很多返工,考虑到变量的名称似乎与他们的目的不相符。所以,让我们来处理(稍微移动一些代码,道歉):
打开文件并将整个文件放入linestring
,到目前为止,除了变量名称之外:
linestring = open(filename, 'r').read()
这是一条线还是标题?都?无论如何,我们现在将整个文件小写并按空格分割。使用我的测试文件,这意味着title_words现在包含fox?
,fox!
,fox
和fox.
title = linestring
title_words = set(title.lower().split())
现在尝试删除常用词。我们假设我们的common_words包含fox
。下一行会移除fox
,但会离开fox?
,fox!
和fox.
keywords = title_words.difference(common_words)
下一行对我来说真的很遗憾,好像它的意思是for line in linestring.split('\n') for word in line.split()
。在当前表单中,keywords
只是一个单词列表,因此line
只是一个没有空格的单词,因此for word in line.split()
无效。我们只是迭代每个单词,删除标点符号,并使其小写。 words_gen
现在包含3个狐狸副本:fox
,fox
,fox
。我们删除了一个没有标点的版本。
words_gen = (word.strip(punctuation).lower() for line in keywords
for word in line.split())
频率分析很有意义。这将根据words_gen生成器中的单词创建直方图。这最终给了我们N个最独特的标点词!在此示例中,fox=3
:
words = {}
for word in words_gen:
words[word] = words.get(word, 0) + 1
top_words = sorted(words.iteritems(), key=itemgetter(1), reverse=True)[:N]
所以出错了什么。其他人已经发布了明确的词频分析解决方案,但我有点表现心态,并想出了我自己的变体。首先,使用正则表达式将文本拆分为单词:
# 1. assumes proper word spacing after punctuation, if not, then something like
# "I ate.I slept" will return "I", "ATEI", "SLEPT"
# 2. handles contractions properly. E.g., "don't" becomes "DONT"
# 3. removes any unexpected characters such as Unicode Non-breaking space and
# non-printable ascii characters (MS Word inserts ASCII 0x05 for
# in-line review comments)
clean = re.sub("[^\w\s]+", "", text.upper())
words = clean.split()
现在基于Python Performance Tips for Initializing Dictionary Entries(以及我自己的测量表现),找出前N个最常用词:
# first create a dictionary that will count the number of words.
# using defaultdict(int) is the 2nd fastest method I measured but
# the most readable. It was very close in speed to "if not w in freq" technique
freq = defaultdict(int)
for w in words:
freq[w] += 1
# remove any of the common words by deleting common keys from the dictionary
for k in common_words:
if k in freq:
del freq[k]
# Ryan's original top-N selection was the fastest of several
# methods I tried including using dictview and lambda functions
# - sort the items by directly accessing item[1] (i.e., the value/frequency count)
top = sorted( freq.iteritems(), key=itemgetter(1), reverse=True)[:N]
关闭金博士的讲话,删除所有文章和代词:
('OF', 99)
('TO', 59)
('AND', 53)
('BE', 33)
('WE', 30)
('WILL', 27)
('THAT', 24)
('IS', 23)
('IN', 22)
('THIS', 20)
而且,对于踢球,我的表现测量:
Original ; 0:00:00.645000 ************
SortAllWords ; 0:00:00.571000 ***********
MyFind ; 0:00:00.870000 *****************
MyImprovedFind ; 0:00:00.551000 ***********
DontInsertCommon ; 0:00:00.649000 ************
JohnGainsJr ; 0:00:00.857000 *****************
ReturnImmediate ; 0:00:00
SortWordsAndReverse ; 0:00:00.572000 ***********
JustCreateDic_GetZero ; 0:00:00.439000 ********
JustCreateDic_TryExcept ; 0:00:00.732000 **************
JustCreateDic_IfNotIn ; 0:00:00.309000 ******
JustCreateDic_defaultdict ; 0:00:00.328000 ******
CreateDicAndRemoveCommon ; 0:00:00.437000 ********
干杯, ë
答案 5 :(得分:0)
不理想,但作为一个词频计数器(这似乎是针对的目标):
from string import punctuation
from operator import itemgetter
import itertools
N = 10
words = {}
linestring = open('test.txt', 'r').read()
common_words = set(("if", "but", "and", "the", "when", "use", "to", "for"))
words = [w.strip(punctuation) for w in linestring.lower().split()]
keywords = itertools.ifilterfalse(lambda w: w in common_words, words)
words = {}
for word in keywords:
words[word] = words.get(word, 0) + 1
top_words = sorted(words.iteritems(), key=itemgetter(1), reverse=True)[:N]
for word, frequency in top_words:
print "%s: %d" % (word, frequency)
答案 6 :(得分:0)
我最近编写了一些类似的代码,虽然风格与你的风格截然不同。也许它会帮助你。
import string
import sys
def main():
# get some stop words
stopf = open('stop_words.txt', "r")
stopwords = {}
for s in stopf:
stopwords[string.strip(s)] = 1
file = open(sys.argv[1], "r")
filedata = file.read()
words=string.split(filedata)
histogram = {}
count = 0
for word in words:
word = string.strip(word, string.punctuation)
word = string.lower(word)
if word in stopwords:
continue
histogram[word] = histogram.get(word, 0) + 1
count = (count+1) % 1000
if count == 0:
print '*',
flist = []
for word, count in histogram.items():
flist.append([count, word])
flist.sort()
flist.reverse()
for pair in flist[0:100]:
print "%30s: %4d" % (pair[1], pair[0])
main()