Python基本计算txt文件中的不同类型

时间:2016-01-31 23:53:09

标签: python

任何人都可以帮我解决这个问题吗?我想计算文本文件中的不同类型。

import sys 
    import re
    import string

    pattern = re.compile("^[a-z][a-z0-9]*$")
    with open('alice.txt','r') as f:
        for line in f:
            for word in line.split():
                lword = word.lower()
                if pattern.match(lword):
                    if len(lword) >= 10:
                         print "Extralong:",'%s%s%d' % (lword, "\t", 1)
                    elif len(lword) in [7, 8, 9] :
                         print "Long:",'%s%s%d' % (lword, "\t", 1)
                    elif len(lword) in [5, 6] :
                         print "Medium:",'%s%s%d' % (lword, "\t", 1)
                    elif len(lword) in [1] and lword in "aeiou":
                         print "Vowel",'%s%s%d' % (lword, "\t", 1) 
                    else :    
                         print "Small:"'%s%s%d' % (lword, "\t", 1)

输出:

Small:the   1
Long: project   1
Long: gutenberg 1
Medium: ebook   1
Small:of    1
Medium: alice   1
Small:in    1
Small:by    1
Medium: lewis   1
Long: carroll   1
Small:this  1
Medium: ebook   1
Small:is    1
Small:for   1
Small:the   1
Small:use   1

我想获得每个人的总金额,例如Small:5,​​Long:3,Medium:3 ...

2 个答案:

答案 0 :(得分:0)

我会计算所有然后合并但另一个替代方法是bisect以使用每个组的最高值作为关键来查看长度将落在何处:

from collections import defaultdict
from bisect import bisect_left
with open("in.txt") as f:
    keys = [1, 4, 6, 9]
    for ln in map(len, map(str.split, f)):
        ind = bisect_left(keys, ln)
        # if ln is between (1-9), ind will be between 0 and 3
        if ind < len(keys): 
            d[keys[ind]] += 1


    print(d)

每次我们一分为二,我们都会在排序列表中找到长度所在的位置:

In [13]: keys = [1, 4, 6, 9]
In [14]: bisect_left(keys, 1)
Out[14]: 0
# range 2-4
In [15]: bisect_left(keys, 3)
Out[15]: 1
# range 2-4
In [16]: bisect_left(keys, 4)
Out[16]: 1
# range 5-6
In [17]: bisect_left(keys, 5)
Out[17]: 2
# range 7-9
In [18]: bisect_left(keys, 7)
Out[18]: 3
# range 7-9
In [19]: bisect_left(keys, 9)
Out[19]: 3
# > 9 
In [20]: bisect_left(keys, 10)
Out[20]: 4

逻辑有点类似于bisect docs中的grade示例函数:

def grade(score, breakpoints=[60, 70, 80, 90], grades='FDCBA'):
     i = bisect(breakpoints, score)
    return grades[i]

答案 1 :(得分:0)

在python 2/3中,集合模块中的Counter可以帮助计算每个项目:

import re
from collections import Counter

words = []
pattern = re.compile("^[a-z][a-z0-9]*$")

with open('alice.txt','r') as f:        
    for line in f:
        for word in line.split():
            lword = word.lower()
            if pattern.match(lword):
                if len(lword) >= 10:
                     words.append("Extralong")
                elif len(lword) in [7, 8, 9] :
                     words.append("Long")
                elif len(lword) in [5, 6] :
                     words.append("Medium")
                elif len(lword) in [2, 3, 4] :    
                     words.append("Small")
                elif len(lword) == 1 and lword in "aeiou":
                     words.append("Vowel")
                else: # lword is 0
                     words.append("Nothing")
    print dict(Counter(words))

考虑以下因素:

  • &#34;没有什么&#34;不会发生因为正则表达式匹配非空单词;

  • 无需word.lower(),因为正则表达式只匹配小写字母和数字。

简化代码可以是:

from re import match
from collections import Counter
with open('alice.txt','r') as f:    
    words = [(len(word) >= 10 and 'Extralong') or (len(word) >= 7 and 'Long') or \
        (len(word) >= 5 and 'Medium') or (len(word) >= 2 and 'Small') or \
    (word in 'aeiou' and 'Vowel') for word in f.read().split() if match(r'^[a-z][a-z0-9]*$', word) ] 
    print dict (Counter(words))

输出结果为:

{'Small': 9, 'Medium': 4, 'Long': 3}