计算单词频率并从中制作字典

时间:2014-02-18 11:15:46

标签: python dictionary count readlines

我想从文本文件中获取每个单词,并在字典中计算单词频率。

示例:'this is the textfile, and it is used to take words and count'

d = {'this': 1, 'is': 2, 'the': 1, ...} 

我不是那么远,但我看不出如何完成它。到目前为止我的代码:

import sys

argv = sys.argv[1]
data = open(argv)
words = data.read()
data.close()
wordfreq = {}
for i in words:
    #there should be a counter and somehow it must fill the dict.

12 个答案:

答案 0 :(得分:8)

如果您不想使用collections.Counter,您可以编写自己的函数:

import sys

filename = sys.argv[1]
fp = open(filename)
data = fp.read()
words = data.split()
fp.close()

unwanted_chars = ".,-_ (and so on)"
wordfreq = {}
for raw_word in words:
    word = raw_word.strip(unwanted_chars)
    if word not in wordfreq:
        wordfreq[word] = 0 
    wordfreq[word] += 1

要获得更好的东西,请查看正则表达式。

答案 1 :(得分:6)

虽然@Michael建议使用来自Counter库的collections是更好的方法,但我只是为了改进你的代码而添加答案(我相信这将是新Python学习者的答案) :

从您的代码中的评论 ,您似乎想要改进代码。我认为你能够用文字阅读文件内容(虽然我通常避免使用read()函数并使用for line in file_descriptor:种代码。

由于words是一个字符串,In for循环,for i in words:循环变量i 不是单词而是char 。您正在迭代字符串中的字符而不是迭代字符串words中的字。要在代码狙击后理解此通知:

>>> for i in "Hi, h r u?":
...  print i
... 
H
i
,

h

r

u
?
>>> 

因为通过字符迭代字符串char而不是逐字逐句不是你想要的,要按字迭代单词,你应该从Python中的字符串类拆分方法/函数。
str.split(str="", num=string.count(str)) 方法返回字符串中所有单词的列表,使用str作为分隔符(如果未指定则拆分所有空格),可选择限制数字拆分为num。

请注意以下代码示例:

分裂:

>>> "Hi, how are you?".split()
['Hi,', 'how', 'are', 'you?']

循环拆分:

>>> for i in "Hi, how are you?".split():
...  print i
... 
Hi,
how
are
you?

它看起来像你需要的东西。除了单词Hi,,因为默认情况下split()会被空格分割,因此Hi,会保留为您不想要的单个字符串(显然)。计算文件中单词的频率。

一个好的解决方案可以是使用正则表达式,但首先要保持简单回答我用replace()方法回答。方法str.replace(old, new[, max])返回字符串的副本,其中旧的出现已被new替换,可选地将替换次数限制为max。

现在查看下面的代码示例,了解我的建议:

>>> "Hi, how are you?".split()
['Hi,', 'how', 'are', 'you?'] # it has , with Hi
>>> "Hi, how are you?".replace(',', ' ').split()
['Hi', 'how', 'are', 'you?'] # , replaced by space then split

循环:

>>> for word in "Hi, how are you?".replace(',', ' ').split():
...  print word
... 
Hi
how
are
you?

现在,如何计算频率:

一种方法是使用Counter作为@Michael建议,但要使用你想要从空dict开始的方法。做类似这样的代码:

words = f.read()
wordfreq = {}
for word in .replace(', ',' ').split():
    wordfreq[word] = wordfreq.setdefault(word, 0) + 1
    #                ^^ add 1 to 0 or old value from dict 

我在做什么?:因为最初wordfreq为空,你不能在第一时间分配给wordfreq[word](它会引起关键异常)。所以我使用了setdefault dict方法。

dict.setdefault(key, default=None)get()类似,但如果密钥不在dict中,则会设置dict[key]=default。因此,第一次出现一个新单词时,我使用0在dict中使用setdefault进行设置,然后添加1并分配给相同的dict。

我使用with open而不是单open编写了等效代码。

with open('~/Desktop/file') as f:
    words = f.read()
    wordfreq = {}
    for word in words.replace(',', ' ').split():
        wordfreq[word] = wordfreq.setdefault(word, 0) + 1
print wordfreq

这样运行:

$ cat file  # file is 
this is the textfile, and it is used to take words and count
$ python work.py  # indented manually 
{'and': 2, 'count': 1, 'used': 1, 'this': 1, 'is': 2, 
 'it': 1, 'to': 1, 'take': 1, 'words': 1, 
 'the': 1, 'textfile': 1}

使用re.split(pattern, string, maxsplit=0, flags=0)

只需更改for循环:for i in re.split(r"[,\s]+", words):,即可生成正确的输出。

编辑:更好地查找所有字母数字字符,因为您可能有多个标点符号。

>>> re.findall(r'[\w]+', words) # manually indent output  
['this', 'is', 'the', 'textfile', 'and', 
  'it', 'is', 'used', 'to', 'take', 'words', 'and', 'count']

使用for循环:for word in re.findall(r'[\w]+', words):

如何在不使用read()的情况下编写代码:

档案是:

$ cat file
This is the text file, and it is used to take words and count. And multiple
Lines can be present in this file.
It is also possible that Same words repeated in with capital letters.

代码是:

$ cat work.py
import re
wordfreq = {}
with open('file') as f:
    for line in f:
        for word in re.findall(r'[\w]+', line.lower()):
            wordfreq[word] = wordfreq.setdefault(word, 0) + 1

print wordfreq

使用lower()将大写字母转换为较低的字母。

输出:

$python work.py  # manually strip output  
{'and': 3, 'letters': 1, 'text': 1, 'is': 3, 
 'it': 2, 'file': 2, 'in': 2, 'also': 1, 'same': 1, 
 'to': 1, 'take': 1, 'capital': 1, 'be': 1, 'used': 1, 
 'multiple': 1, 'that': 1, 'possible': 1, 'repeated': 1, 
 'words': 2, 'with': 1, 'present': 1, 'count': 1, 'this': 2, 
 'lines': 1, 'can': 1, 'the': 1}

答案 2 :(得分:3)

from collections import Counter
t = 'this is the textfile, and it is used to take words and count'

dict(Counter(t.split()))
>>> {'and': 2, 'is': 2, 'count': 1, 'used': 1, 'this': 1, 'it': 1, 'to': 1, 'take': 1, 'words': 1, 'the': 1, 'textfile,': 1}

或者在计算之前删除标点符号更好:

dict(Counter(t.replace(',', '').replace('.', '').split()))
>>> {'and': 2, 'is': 2, 'count': 1, 'used': 1, 'this': 1, 'it': 1, 'to': 1, 'take': 1, 'words': 1, 'the': 1, 'textfile': 1}

答案 3 :(得分:2)

以下接受字符串,将其拆分为带有split()的列表,用于循环列表并计数 使用Python的count函数count()来判断句子中每个项目的频率。该 单词,i和它的频率作为元组放在一个空列表中,然后转换成 使用dict()的键和值对。

sentence = 'this is the textfile, and it is used to take words and count'.split()
ls = []  
for i in sentence:

    word_count = sentence.count(i)  # Pythons count function, count()
    ls.append((i,word_count))       


dict_ = dict(ls)

print dict_

输出; {'和':2,'计数':1,'使用':1,'此':1,'是':2,'它':1,'到':1,'取':1,'单词':1,'':','textfile,':1}

答案 4 :(得分:1)

#open your text book,Counting word frequency
File_obj=open("Counter.txt",'r')
w_list=File_obj.read()
print(w_list.split())
di=dict()
for word in w_list.split():


    if word in di:
        di[word]=di[word] + 1

    else:
        di[word]=1



max_count=max(di.values())
largest=-1
maxusedword=''
for k,v in di.items():
    print(k,v)
    if v>largest:
        largest=v
        maxusedword=k

print(maxusedword,largest)

答案 5 :(得分:0)

sentence = "this is the textfile, and it is used to take words and count"

# split the sentence into words.
# iterate thorugh every word

counter_dict = {}
for word in sentence.lower().split():
# add the word into the counter_dict initalize with 0
  if word not in counter_dict:
    counter_dict[word] = 0
# increase its count by 1   
  counter_dict[word] =+ 1

答案 6 :(得分:0)

我的方法是从地面做几件事:

  1. 从文本输入中删除标点符号。
  2. 列出单词列表。
  3. 删除空字符串。
  4. 遍历列表。
  5. 使每个新单词成为字典中具有值1的关键字。
  6. 如果单词已经作为关键字存在,则将其值加1。

text = '''this is the textfile, and it is used to take words and count'''
word = '' #This will hold each word

wordList = [] #This will be collection of words
for ch in text: #traversing through the text character by character
#if character is between a-z or A-Z or 0-9 then it's valid character and add to word string..
    if (ch >= 'a' and ch <= 'z') or (ch >= 'A' and ch <= 'Z') or (ch >= '0' and ch <= '9'): 
        word += ch
    elif ch == ' ': #if character is equal to single space means it's a separator
        wordList.append(word) # append the word in list
        word = '' #empty the word to collect the next word
wordList.append(word)  #the last word to append in list as loop ended before adding it to list
print(wordList)

wordCountDict = {} #empty dictionary which will hold the word count
for word in wordList: #traverse through the word list
    if wordCountDict.get(word.lower(), 0) == 0: #if word doesn't exist then make an entry into dic with value 1
        wordCountDict[word.lower()] = 1
    else: #if word exist then increament the value by one
        wordCountDict[word.lower()] = wordCountDict[word.lower()] + 1
print(wordCountDict)

另一种方法:

text = '''this is the textfile, and it is used to take words and count'''
for ch in '.\'!")(,;:?-\n':
    text = text.replace(ch, ' ')
wordsArray = text.split(' ')
wordDict = {}
for word in wordsArray:
    if len(word) == 0:
        continue
    else:
        wordDict[word.lower()] = wordDict.get(word.lower(), 0) + 1
print(wordDict)

答案 7 :(得分:0)

您还可以使用int类型的默认字典。

 from collections import defaultdict
 wordDict = defaultdict(int)
 text = 'this is the textfile, and it is used to take words and count'.split(" ")
 for word in text:
    wordDict[word]+=1

说明: 我们初始化一个默认字典,其值是int类型。这样,任何键的默认值将为0,我们不需要检查字典中是否存在键。然后,我们将带有空格的文本分成单词列表。然后我们遍历列表并增加单词计数的计数。

答案 8 :(得分:0)

wordList = 'this is the textfile, and it is used to take words and count'.split()
wordFreq = {}

# Logic: word not in the dict, give it a value of 1. if key already present, +1.
for word in wordList:
    if word not in wordFreq:
        wordFreq[word] = 1
    else:
        wordFreq[word] += 1

print(wordFreq)

答案 9 :(得分:0)

另一个功能:

def wcount(filename):
    counts = dict()
    with open(filename) as file:
        a = file.read().split()
        # words = [b.rstrip() for b in a]
    for word in a:
        if word in counts:
            counts[word] += 1
        else:
            counts[word] = 1
    return counts

答案 10 :(得分:0)

def play_with_words(input):

+

input ="i,am,here,where,u,are"

打印(play_with_words(输入))

答案 11 :(得分:0)

Write a Python program to create a list of strings by taking input from the user and then create  a dictionary containing each string along with their frequencies. (e.g. if the list is [‘apple’,  ‘banana’, ‘fig’, ‘apple’, ‘fig’, ‘banana’, ‘grapes’, ‘fig’, ‘grapes’, ‘apple’] then output should be  {'apple': 3, 'banana': 2, 'fig': 3, 'grapes': 2}.  

lst = []
d = dict()
print("ENTER ZERO NUMBER FOR EXIT !!!!!!!!!!!!")
while True:
    user = input('enter string element :: -- ')
    if user == "0":
        break
    else:
        lst.append(user)
print("LIST ELEMENR ARE :: ",lst)
l = len(lst)
for i in range(l) :
    c = 0
    for j in range(l) :
        if lst[i] == lst[j ]:
            c += 1
    d[lst[i]] = c
print("dictionary is  :: ",d)