python中的文本文件中的单词频率

时间:2017-12-26 13:58:54

标签: python

我想查找wanted中某些单词的频率,虽然它找到了频率,但显示的结果中包含大量不必要的数据。

代码:

from collections import Counter
import re
wanted = "whereby also thus"
cnt = Counter()
words = re.findall('\w+', open('C:/Users/user/desktop/text.txt').read().lower())
for word in words:
    if word in wanted:
        cnt[word] += 1
print (cnt)

结果:

Counter({'e': 131, 'a': 119, 'by': 38, 'where': 16, 's': 14, 'also': 13, 'he': 4, 'whereby': 2, 'al': 2, 'b': 2, 'o': 1, 't': 1})

问题:

  1. 我如何省略所有'e','a''by','where'等?
  2. 如果我想总结单词的频率(也就是这样),并将它们除以文本中的单词总数,那么这是可能的吗?
  3. 免责声明:这不是学校作业。我现在有大量的空闲时间在工作,因为我花了很多时间阅读文本,我决定做我的这个小项目,提醒自己一些我几年前所教过的东西。

    提前感谢您的帮助。

2 个答案:

答案 0 :(得分:1)

正如其他人指出的那样,您需要将字符串wanted更改为列表。我刚刚对列表进行了硬编码,但如果在函数中传递了一个字符串,则可以使用str.split(" ")。我还为你实现了频率计数器。就像一张纸条一样,请确保关闭文件;使用open指令也更容易(并且推荐)。

from collections import Counter
import re
wanted = ["whereby", "also", "thus"]
cnt = Counter()
with open('C:/Users/user/desktop/text.txt', 'r') as fp:
    fp_contents = fp.read().lower()
words = re.findall('\w+', fp_contents)
for word in words:
    if word in wanted:
        cnt[word] += 1
print (cnt)

total_cnt = sum(cnt.values())

print(float(total_cnt)/len(cnt))

答案 1 :(得分:0)

从网上阅读

我制作了Axel代码的这个小模型来读取网络上的txt,爱丽丝梦游仙境,以应用代码(因为我没有你的txt文件,我想尝试一下)。所以,我在这里发布它,以防有人需要这样的东西。

from collections import Counter
import re
from urllib.request import urlopen
testo = str(urlopen("https://www.gutenberg.org/files/11/11.txt").read())
wanted = ["whereby", "also", "thus", "Alice", "down", "up", "cup"]
cnt = Counter()
words = re.findall('\w+', testo)
for word in words:
    if word in wanted:
        cnt[word] += 1
print(cnt)

total_cnt = sum(cnt.values())

print(float(total_cnt) / len(cnt))
  

输出

Counter({'Alice': 334, 'up': 97, 'down': 90, 'also': 4, 'cup': 2})
105.4
>>> 

在相邻句子中找到相同单词的次数

这个答案(来自问题的作者)要求查找在相邻句子中找到一个单词的次数。如果在一个句子中有更多相同的单词(例如:'有')而在下一个单词中有另一个相等的单词,我将其计为1个成熟。这就是我使用wordfound列表的原因。

from collections import Counter
import re


testo = """There was nothing so VERY remarkable in that; nor did Alice think it so? Thanks VERY much. Out of the way to hear the Rabbit say to itself, 'Oh dear! Oh dear! I shall be late!' (when she thought it over afterwards, it occurred to her that she ought to have wondered at this, but at the time it all seemed. Quite natural); but when the Rabbit actually TOOK A WATCH OUT OF ITS? WAISTCOAT-POCKET, and looked at it, and then hurried on.
Alice started to her feet, for it flashed across her mind that she had never before seen a rabbit. with either a waistcoat-pocket, or a watch to take out of it! and burning with curiosity, she ran across the field after it, and fortunately was just in time to see it pop? Down a large rabbit-hole under the hedge.
Alice opened the door and found that it led into a small passage, not much larger than a rat-hole: she knelt down and looked along the passage into the loveliest garden you ever saw. How she longed to get out of that dark hall, and wander about among those beds of bright flowers and those cool fountains, but she could not even get her head through the doorway; 'and even if my head would go through,' thought poor Alice, 'it would be of very little use without my shoulders. Oh, how I wish I could shut up like a telescope! I think I could, if I only knew how to begin.'For, you see, so many out-of-the-way things had happened lately, that Alice had begun to think that very few things indeed were really impossible. There seemed to be no use in waiting by the little door, so she went back to the table, half hoping she might find another key on it, or at any rate a book of rules for shutting people up like telescopes: this time she found a little bottle on it, ('which certainly was not here before,' said Alice,) and round the neck of the bottle was a paper label, with the words 'DRINK ME' beautifully printed on it in large letters. It was all very well to say 'Drink me,' but the wise little Alice was not going to do THAT in a hurry. 'No, I'll look first,' she said, 'and see whether it's marked "poison" or not'; for she had read several nice little histories about children who had got burnt, and eaten up by wild beasts and other unpleasant things, all because they WOULD not remember the simple rules their friends had taught them: such as, that a red-hot poker will burn you if you hold it too long; and that if you cut your finger VERY deeply with a knife, it usually bleeds; and she had never forgotten that, if you drink much from a bottle marked 'poison,' it is almost certain to disagree with you, sooner or later. However, this bottle was NOT marked 'poison,' so Alice ventured to taste it, and finding it very nice, (it had, in fact, a sort of mixed flavour of cherry-tart, custard, pine-apple, roast turkey, toffee, and hot buttered toast,) she very soon finished it off. """


frasi = re.findall("[A-Z].*?[\.!?]", testo, re.MULTILINE | re.DOTALL)

print("How many times this words are repeated in adjacent sentences:")
cnt2 = Counter()
for n, s in enumerate(frasi):
    words = re.findall("\w+", s)
    wordfound = []
    for word in words:
        try:
            if word in frasi[n + 1]:
                wordfound.append(word)
                if wordfound.count(word) < 2:
                    cnt2[word] += 1
        except IndexError:
            pass
for k, v in cnt2.items():
    print(k, v)
  

输出

How many times this words are repeated in adjacent sentences:
had 1
hole 1
or 1
as 1
little 2
that 1
hot 1
large 1
it 5
to 5
a 6
not 3
and 2
s 1
me 1
bottle 1
is 1
no 1
the 6
how 1
Oh 1
she 2
at 1
marked 1
think 1
VERY 1
I 2
door 1
red 1
of 1
dear 1
see 1
could 2
in 2
so 1
was 1
poison 1
A 1
Alice 3
all 1
nice 1
rabbit 1