计算一段文本中最常见的名义词

时间:2017-08-05 04:57:40

标签: python python-3.x

我必须执行一项任务,我打开一个文本文件,然后计算每个单词大写的次数。然后我需要打印前3次出现。 这段代码一直有效,直到它得到一个文本文件,其中的单词会在一行中加倍。

txt文件1:

Jellicle Cats are black and white,
Jellicle Cats are rather small;
Jellicle Cats are merry and bright,
And pleasant to hear when they caterwaul.
Jellicle Cats have cheerful faces,
Jellicle Cats have bright black eyes;
They like to practise their airs and graces
And wait for the Jellicle Moon to rise.

结果:

6 Jellicle
5 Cats
2 And

txt文件2:

Baa Baa black sheep have you any wool?
Yes sir Yes sir, wool for everyone.
One for the master, 
One for the dame.
One for the little boy who lives down the lane.

结果:

1 Baa
1 One
1 Yes
1 Baa
1 One
1 Yes
1 Baa
1 One
1 Yes

这是我的代码:

wc = {}
t3 = {}
p = 0
xx=0
a = open('novel.txt').readlines()
for i in a:
  b = i.split()
  for l in b:
    if l[0].isupper():
      if l not in wc:
         wc[l] = 1
      else:
        wc[l] += 1
while p < 3:
  p += 1
  max_val=max(wc.values())
  for words in wc:
    if wc[words] == max_val:
      t3[words] = wc[words]
      wc[words] = 1

    else:
      null = 1
while xx < 3:
  xx+=1
  maxval = max(t3.values())
  for word in sorted(t3):
    if t3[word] == maxval:
      print(t3[word],word)
      t3[word] = 1
    else:
      null+=1

请帮我解决这个问题。谢谢!

感谢您提出的所有建议。在手动调试代码以及​​使用您的回复之后,我能够发现while xx < 3:是不必要的,并且wc[words] = 1最终使程序重复计算,如果第三个最常发生的话这个词发生一次。通过用wc[words] = 0替换它,我能够避免计数循环。

谢谢!

2 个答案:

答案 0 :(得分:4)

这非常简单。但是你需要一些工具。

  1. re.sub,摆脱标点符号

  2. filter,使用str.istitle

  3. 按标题大小过滤单词
  4. collections.Counter,计算单词(先{do from collections import Counter)。

  5. 假设text拥有你的para(第一个),这有效:

    In [296]: Counter(filter(str.istitle, re.sub('[^\w\s]', '', text).split())).most_common(3)
    Out[296]: [('Jellicle', 6), ('Cats', 5), ('And', 2)]
    

    Counter.most_common(x)会返回x最常用的字词。

    巧合的是,这是你的第二段的输出:

    [('One', 3), ('Baa', 2), ('Yes', 2)]
    

答案 1 :(得分:0)

import operator

fname = 'novel.txt'
fptr = open(fname)
x = fptr.read()
words = x.split()
data = {}
p = 0

for word in words:
    if word[0].isupper():
        if word in data:
            data[word] = data[word] + 1
        else:
            data[word] = 1

valores_ord = dict(sorted(data.items(), key=operator.itemgetter(1), reverse=True)[:3])

for word in valores_ord:
    print(valores_ord[word],word)