我有一个像这个example.txt的大文本文件:
http://www.fullbooks.com/The-Jacket-Star-Rover-1.html
用awk:
import requests
from PIL import Image
r = requests.get(URL, stream=True)
img = Image.open(r.raw)
输出是连续三个最常见词汇的前20名:
cat example.txt | awk '{ print substr($0, index($0,$3)) }' | tr -sc "[A-Z][a-z][0-9]'" '[\012*]' | awk -- 'first!=""&&second!="" { print first,second,$0; } { first=second; second=$0; }' | sort | uniq -c | sort -nr | head -n20
从:
开始 13 in the jacket
11 I was a
10 of the Yard
10 me in the
8 Captain of the
7 times and places
7 the Captain of
7 in the prison
7 in the dungeons
7 in San Quentin
7 I had been
6 other times and
6 hours in the
6 are going to
5 twenty four hours
5 to take me
5 the rest of
5 the forty lifers
5 the Board of
5 that I had
如何用python3获得相同的效果?
答案 0 :(得分:3)
这是计算单词频率的一个很好的变体,但没有那么不同。我会:
collections.Counter
(使用tuple
类型,以便它可以播放)import collections
with open('example.txt') as raw:
words = raw.read().split()
c = collections.Counter(tuple(words[i:i+3]) for i in range(len(words)-3))
for x in sorted([(k,v) for k,v in c.items() if v>=5] ,key = lambda x : x[1],reverse=True):
print(x)
请注意,使用str.split()
进行拆分在标点符号时效果不佳(因为例如"Hello, World"
分为"Hello,"
和"World
),所以我们会更好在非alphanum char上使用正则表达式分割:
words = [x for x in re.split("\W",raw.read()) if x]
我得到了这个结果(出现的次数多于天真的str.split
):
(('in', 'the', 'jacket'), 19)
(('of', 'the', 'Yard'), 13)
(('Captain', 'of', 'the'), 12)
(('I', 'was', 'a'), 12)
(('me', 'in', 'the'), 11)
(('in', 'the', 'prison'), 11)
(('in', 'the', 'dungeons'), 10)
(('hours', 'in', 'the'), 9)
(('in', 'San', 'Quentin'), 9)
(('I', 'don', 't'), 8)
(('He', 'was', 'a'), 8)
(('are', 'going', 'to'), 8)
(('I', 'had', 'been'), 7)
(('I', 'have', 'been'), 7)
(('in', 'order', 'to'), 7)
(('times', 'and', 'places'), 7)
(('five', 'pounds', 'of'), 7)
(('and', 'I', 'have'), 7)
(('the', 'Captain', 'of'), 7)
(('Darrell', 'Standing', 's'), 6)
(('I', 'did', 'not'), 6)
(('five', 'years', 'of'), 6)
(('Warden', 'Atherton', 'and'), 6)
(('Board', 'of', 'Directors'), 6)
(('thirty', 'five', 'pounds'), 6)
(('that', 'I', 'had'), 6)
(('pounds', 'of', 'dynamite'), 6)
(('other', 'times', 'and'), 6)
(('of', 'San', 'Quentin'), 5)
(('the', 'forty', 'lifers'), 5)
(('and', 'Captain', 'Jamie'), 5)
(('I', 'Darrell', 'Standing'), 5)
(('in', 'the', 'dungeon'), 5)
(('going', 'to', 'take'), 5)
...
或者,我们可以通过将单词转换为小写来获得不同的结果,以便合并开始句子的单词("in the woods"
vs "In the woods"
)
答案 1 :(得分:0)
你可以尝试这个简单的实现:
import re
frequency={}
with open('example.txt') as raw:
words = [word.lower() for word in re.split("\W",raw.read()) if word]
for index, word in enumerate(words):
if index < (len(words)-2):
triplet = (word, words[index+1], words[index+2])
if triplet in frequency:
frequency[triplet] += 1
else:
frequency[triplet] = 1
for triplet, rank in frequency.items():
print(triplet,str(rank))