Python字符和单词计数

时间:2015-03-30 04:47:55

标签: python string count character counter

我是python的初学者,想知道如何使用两个txt文件来计算字符以及对抗10个最常见的字符。还有如何将文件中的所有字符转换为小写并删除除a-z以外的所有字符

这是我尝试过但没有运气的地方:

from string import ascii_lowercase
from collections import Counter

with open ('document1.txt' , 'document2.txt') as f:
    print Counter(letter for line in f
                    for letter in line.lower()
                    if letter in ascii_lowercase)

3 个答案:

答案 0 :(得分:2)

这是一个简单的例子。您可以调整此代码以满足您的需求

from string import ascii_lowercase
from collections import Counter

with open('file1.txt', 'r') as file1data: #opening an reading file one
    file1 = file1data.read().lower() #convert the entire file contents to lower

with open('file2.txt', 'r') as file2data: #opening an reading file two
    file2 = file2data.read().lower() 

#The contents of both file 1 and 2 are stored in fil1 and file2 variables
#Examples of how to work with one file repeat for two files
file1_list = []
for ch in file1:
    if ch in ascii_lowercase: #makes sure only lowercase alphabet is appended.  All Non alphabet characters are removed
        file1_list.append(ch)
    elif ch in [" ", ".", ",", "'"]: #remove this elif block is you just want the letters
        file1_list.append(ch) #make sure basic punctionation is kept

print "".join(file1_list) #this line is not needed. Just to show what the text looks like now
print Counter(file1_list).most_common(10) #prints the top ten
print Counter(file1_list) #prints the number of characters and how many times they repeat

现在你已经回顾了上面的那个烂摊子,并且知道每条线路在做什么,这里有一个更干净的版本,可以满足您的需求。

from string import ascii_lowercase
from collections import Counter

with open('file1.txt', 'r') as file1data: 
    file1 = file1data.read().lower()

with open('file2.txt', 'r') as file2data: 
    file2 = file2data.read().lower() 

file1_list = []
for ch in file1:
    if ch in ascii_lowercase: 
        file1_list.append(ch)

file2_list = []
for ch in file2:
    if ch in ascii_lowercase: 
        file2_list.append(ch)



all_counter = Counter(file1_list + file2_list) 
top_ten_counter = Counter(file1_list + file2_list).most_common(10) 

print sorted(all_counter.items()) 
print sorted(top_ten_counter)

答案 1 :(得分:2)

尝试这样:

>>> from collections import Counter
>>> import re
>>> words = re.findall(r'\w+', "{} {}".format(open('your_file1').read().lower(), open('your_file2').read().lower()))
>>> Counter(words).most_common(10)

答案 2 :(得分:1)

不幸的是,如果不重写它,就无法插入到文件的中间。正如之前的海报所示,您可以使用搜索附加到文件或覆盖部分文件,但如果您想在开头或中间添加内容,则必须重写它。

这是一个操作系统的东西,而不是Python的东西。它在所有语言中都是一样的。

我通常做的是从文件中读取,进行修改并将其写入名为myfile.txt.tmp的新文件或类似的东西。这比将整个文件读入内存要好,因为文件可能太大了。完成临时文件后,我将其重命名为原始文件。

这是一种安全的好方法,因为如果文件写入因任何原因而崩溃或中止,您仍然可以使用原始文件。

要从多个文件中找到最常见的words

from collections import Counter
import re
with open(''document1.txt'') as f1, open(''document1.txt'') as f2:
    words = re.findall(r'\w+', f1.read().lower()) + re.findall(r'\w+', f2.read().lower())
    >>>Counter(words).most_common(10)
    "wil give you most 10 common words"

如果您想要最常见的10个characters

>>>Counter(f1.read() + f2.read()).most_common(10)