如何在python中查找单词序列?

时间:2017-10-26 12:55:16

标签: python python-3.x

我有一个像这个example.txt的大文本文件:
http://www.fullbooks.com/The-Jacket-Star-Rover-1.html
用awk:

import requests
from PIL import Image

r = requests.get(URL, stream=True)
img = Image.open(r.raw)

输出是连续三个最常见词汇的前20名:

cat example.txt | awk '{ print substr($0, index($0,$3)) }' | tr -sc "[A-Z][a-z][0-9]'" '[\012*]' | awk -- 'first!=""&&second!="" { print first,second,$0; } { first=second; second=$0; }' | sort | uniq -c | sort -nr | head -n20

从:

开始
 13 in the jacket
 11 I was a
 10 of the Yard
 10 me in the
  8 Captain of the
  7 times and places
  7 the Captain of
  7 in the prison
  7 in the dungeons
  7 in San Quentin
  7 I had been
  6 other times and
  6 hours in the
  6 are going to
  5 twenty four hours
  5 to take me
  5 the rest of
  5 the forty lifers
  5 the Board of
  5 that I had

如何用python3获得相同的效果?

2 个答案:

答案 0 :(得分:3)

这是计算单词频率的一个很好的变体,但没有那么不同。我会:

  • 读取文件并像你一样拆分
  • 创建三元组并将其提供给collections.Counter(使用tuple类型,以便它可以播放)
  • 过滤/排序以显示5次以上的出现
像这样:

import collections

with open('example.txt') as raw:
    words = raw.read().split()

c = collections.Counter(tuple(words[i:i+3]) for i in range(len(words)-3))
for x in sorted([(k,v) for k,v in c.items() if v>=5] ,key = lambda x : x[1],reverse=True):
    print(x)

请注意,使用str.split()进行拆分在标点符号时效果不佳(因为例如"Hello, World"分为"Hello,""World),所以我们会更好在非alphanum char上使用正则表达式分割:

words = [x for x in re.split("\W",raw.read()) if x]

我得到了这个结果(出现的次数多于天真的str.split):

(('in', 'the', 'jacket'), 19)
(('of', 'the', 'Yard'), 13)
(('Captain', 'of', 'the'), 12)
(('I', 'was', 'a'), 12)
(('me', 'in', 'the'), 11)
(('in', 'the', 'prison'), 11)
(('in', 'the', 'dungeons'), 10)
(('hours', 'in', 'the'), 9)
(('in', 'San', 'Quentin'), 9)
(('I', 'don', 't'), 8)
(('He', 'was', 'a'), 8)
(('are', 'going', 'to'), 8)
(('I', 'had', 'been'), 7)
(('I', 'have', 'been'), 7)
(('in', 'order', 'to'), 7)
(('times', 'and', 'places'), 7)
(('five', 'pounds', 'of'), 7)
(('and', 'I', 'have'), 7)
(('the', 'Captain', 'of'), 7)
(('Darrell', 'Standing', 's'), 6)
(('I', 'did', 'not'), 6)
(('five', 'years', 'of'), 6)
(('Warden', 'Atherton', 'and'), 6)
(('Board', 'of', 'Directors'), 6)
(('thirty', 'five', 'pounds'), 6)
(('that', 'I', 'had'), 6)
(('pounds', 'of', 'dynamite'), 6)
(('other', 'times', 'and'), 6)
(('of', 'San', 'Quentin'), 5)
(('the', 'forty', 'lifers'), 5)
(('and', 'Captain', 'Jamie'), 5)
(('I', 'Darrell', 'Standing'), 5)
(('in', 'the', 'dungeon'), 5)
(('going', 'to', 'take'), 5)
...

或者,我们可以通过将单词转换为小写来获得不同的结果,以便合并开始句子的单词("in the woods" vs "In the woods"

答案 1 :(得分:0)

你可以尝试这个简单的实现:

import re

frequency={}
with open('example.txt') as raw:
    words = [word.lower() for word in re.split("\W",raw.read()) if word]

for index, word in enumerate(words):
    if index < (len(words)-2):
        triplet = (word, words[index+1], words[index+2])
        if triplet in frequency:
            frequency[triplet] += 1
        else:
            frequency[triplet] = 1

for triplet, rank in frequency.items():
    print(triplet,str(rank))