我正在寻找一个python程序,它计算文本中每个单词的频率,并输出每个单词及其出现的计数和行号。
我们将一个单词定义为非空白字符的连续序列。 (提示:split()
)
注意:相同字符序列的不同大小写应被视为相同的单词,例如Python和python,我和我。
输入将是几行,空行终止文本。输入中仅存在字母字符和空格。
输出格式如下:
每行以一个数字开头,表示单词的频率,一个空格,然后是单词本身,以及一个包含该单词的行号列表。
示例输入
Python is a cool language but OCaml
is even cooler since it is purely functional
示例输出
3 is 1 2
1 a 1
1 but 1
1 cool 1
1 cooler 2
1 even 2
1 functional 2
1 it 2
1 language 1
1 ocaml 1
1 purely 2
1 python 1
1 since 2
PS。 我不是学生我自己学习Python ..
答案 0 :(得分:5)
使用collections.defaultdict
,collections.Counter
和string formatting:
from collections import Counter, defaultdict
data = """Python is a cool language but OCaml
is even cooler since it is purely functional"""
result = defaultdict(lambda: [0, []])
for i, l in enumerate(data.splitlines()):
for k, v in Counter(l.split()).items():
result[k][0] += v
result[k][1].append(i+1)
for k, v in result.items():
print('{1} {0} {2}'.format(k, *v))
输出:
1 since [2] 3 is [1, 2] 1 a [1] 1 it [2] 1 but [1] 1 purely [2] 1 cooler [2] 1 functional [2] 1 Python [1] 1 cool [1] 1 language [1] 1 even [2] 1 OCaml [1]
如果订单很重要,您可以这样对结果进行排序:
items = sorted(result.items(), key=lambda t: (-t[1][0], t[0].lower()))
for k, v in items:
print('{1} {0} {2}'.format(k, *v))
输出:
3 is [1, 2] 1 a [1] 1 but [1] 1 cool [1] 1 cooler [2] 1 even [2] 1 functional [2] 1 it [2] 1 language [1] 1 OCaml [1] 1 purely [2] 1 Python [1] 1 since [2]
答案 1 :(得分:1)
频率制表通常最好用counter解决。
from collections import Counter
word_count = Counter()
with open('input', 'r') as f:
for line in f:
for word in line.split(" "):
word_count[word.strip().lower()] += 1
for word, count in word_count.iteritems():
print "word: {}, count: {}".format(word, count)
答案 2 :(得分:1)
好的,所以你已经识别出split以将你的字符串变成单词列表。但是,您希望列出每个单词出现的行,因此您应首先将字符串拆分为行,然后再拆分为单词。然后,您可以创建一个字典,其中键是单词(先放入小写),值可以是包含出现次数和出现次数的结构。
您可能还需要输入一些代码来检查某些内容是否有效(例如,它是否包含数字),以及清理单词(删除标点符号)。我会把这些留给你。
def wsort(item):
# sort descending by count, then ascending alphabetically
word, freq = item
return -freq['count'], word
def wfreq(str):
words = {}
# split by line, then by word
lines = [line.split() for line in str.split('\n')]
for i in range(len(lines)):
for word in lines[i]:
# if the word is not in the dictionary, create the entry
word = word.lower()
if word not in words:
words[word] = {'count':0, 'lines':set()}
# update the count and add the line number to the set
words[word]['count'] += 1
words[word]['lines'].add(i+1)
# convert from a dictionary to a sorted list using wsort to give the order
return sorted(words.iteritems(), key=wsort)
inp = "Python is a cool language but OCaml\nis even cooler since it is purely functional"
for word, freq in wfreq(inp):
# generate the desired list format
lines = " ".join(str(l) for l in list(freq['lines']))
print "%i %s %s" % (freq['count'], word, lines)
这应该提供与样本完全相同的输出:
3 is 1 2
1 a 1
1 but 1
1 cool 1
1 cooler 2
1 even 2
1 functional 2
1 it 2
1 language 1
1 ocaml 1
1 purely 2
1 python 1
1 since 2
答案 3 :(得分:0)
首先找到文本中出现的所有单词。使用split()
。
如果文本存在于文件中,那么我们将首先将其添加到字符串中,并将其全部text
。同时从文本中删除所有\n
。
filin=open('file','r')
di = readlines(filin)
text = ''
for i in di:
text += i</pre></code>
现在检查文本中每个单词的出现次数。我们稍后会处理这些行号。
dicts = {}
for i in words_list:
dicts[i] = 0
for i in words_list:
for j in range(len(text)):
if text[j:j+len(i)] == i:
dicts[i] += 1
现在我们有一个字典,其中的单词为键,值为单词出现在文本中的次数。
现在为行号:
dicts2 = {}
for i in words_list:
dicts2[i] = 0
filin.seek(0)
for i in word_list:
filin.seek(0)
count = 1
for j in filin:
if i in j:
dicts2[i] += (count,)
count += 1
现在dicts2将单词作为键,将行号列表作为值。在一个元组里面
如果数据已经在字符串中,您只需删除所有\n
。
di = split(string_containing_text,'\n')
其他一切都是一样的。
我相信你可以格式化输出。