Question

我有一个脚本，我从文件中读取并从行中的每个单词中取出前两个字符，我想要做的是找出最常出现的两个字母，我是否必须转换我的将输出输出到列表并按此方式执行？

这是我的

#!/usr/bin/python

import string
import re
import random
import sys


file = raw_input("Enter path to filename :")

text_file= open(file,'r')
data=text_file.readlines()
firsttwo =[]
lines = []

def first2():
    for line in data:   
    firsttwo = line[:2]
    print firsttwo

print first2()

Answer 1

您可以使用Counter来计算列表中项目的外观。

from collections import Counter

text_file= open("C:/test.txt",'r')
firsttwo = [line[:2] for line in text_file.readlines()]

print Counter(firsttwo)

如果test.txt的内容是：

first line
second line
second line
third line

提供的代码输出为：

Counter({'se': 2, 'fi': 1, 'th': 1})

如果要将此输出转换为列表，可以执行以下操作：

list(Counter(firsttwo).items())

输出：

[('fi', 1), ('th', 1), ('se', 2)]

编辑（没有收藏）：

text_file= open("C:/test.txt",'r')
firsttwo = [line[:2] for line in text_file.readlines()]
l_items = set(firsttwo) 
l_counts = [(firsttwo.count(x), x) for x in set(firsttwo)]
l_counts.sort(reverse=True)
print l_counts[0][1]

Answer 2

要构建初始字符串，请使用生成器理解和join()：

In [49]: mystring="".join(line[:2] for line in data)

这可以使用count()对象的str方法解决：

In [50]: mystring="helloworld"

In [51]: mystring.count("o")
Out[51]: 2

如果您希望最常见的项目使用sorted和string.ascii_letters：

In [52]: from string import ascii_letters as letters
In [71]: mystring
Out[71]: "Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. "

In [72]: sorted((mystring.count(l),l) for l in letters)[:-5:-1]
Out[72]: [(23, 'e'), (20, 't'), (17, 'n'), (14, 's')]

Answer 3

我就是这样做的：

import re
import collections
from collections import Counter

my_file = open("text.txt", 'r')
lines_from_file = my_file.readlines()
first_two_letters = " ".join(item[:2].upper() for item in re.findall("\w+", str(lines_from_file)))

processed_letters = first_two_letters.split()

resulting_count = collections.Counter(processed_letters)

print resulting_count

这可能不是最好的方式，但是：

正在阅读文件
存储每个单词的前两个字母
使用collections计数器会计算每组字母

查找所选字母出现的次数

3 个答案: