在行中查找单词并计算行数

时间:2014-03-22 16:14:07

标签: python

在档案中:

aaa 012 345
abc deg hij
hij aaa 075
aaa 345 658

我试过了:

filer = file.read().split('\n')
count = 0
for line in filer:
    lines = line.split(' ')
    for words in lines:
        #print words, lines.count(words)
        if words in set(lines):
            count = count + 1
            print words, ', count line: ', count

结果显示:

aaa , count line:  1
012 , count line:  2
345 , count line:  3
abc , count line:  4
deg , count line:  5
hij , count line:  6
hij , count line:  7
aaa , count line:  8
075 , count line:  9
aaa , count line:  10
345 , count line:  11
658 , count line:  12

我想计算并打印包含每个单词的行的总数。 (对不起我的解释。)

预期结果:

aaa , count line: 3
012 , count line: 1
345 , count line: 2

abc , count line: 1
deg , count line: 1
hij , count line: 2

hij , count line: 2
aaa , count line: 3
075 , count line: 1

aaa , count line: 3
345 , count line: 2
658 , count line: 1

是否有任何建议按原始行顺序打印预期结果?

因为我需要它们才能用来计算"在行频率中使用的词的频率"。

例如:' aaa'的频率将使用总行数除以包含单词' aaa'的行数来计算。

2 个答案:

答案 0 :(得分:2)

collections.Counter是出于此目的而制作的:

from collections import Counter

counter = Counter()

with open('data.txt') as data:
    for line in data:
        counter.update(line.split())

for item, count in counter.items():
    print "%s , count: %s" % (item, count)

输出:

abc, count: 1
aaa, count: 3
345, count: 2
012, count: 1
075, count: 1
hij, count: 2
658, count: 1
deg, count: 1

编辑:我仍然不清楚您正在寻找的最终结果,但这会产生您要求的确切输出:

from collections import Counter

line_frequencies = Counter()

with open('data.txt') as data:
    lines = [line.split() for line in data]

for line in lines:
    unique_line = set(line)
    line_frequencies.update(unique_line)


for line in lines:
    for term in line:
        print "%s , count line: %s" % (term, line_frequencies[term])
    print "\n"

答案 1 :(得分:1)

您需要将计数与每个变量联系起来。我建议你试试像

这样的东西
file = open("this.txt",r)
tokenCount = {}

for line in file:
  for token in line.split(' '):
    if token in tokenCount.keys():
      tokenCount[token] += 1
    else:
      tokenCount[token] = 1 

for item in tokenCount:
  print item, ' , count line: ' tokenCount[item]

您的输出有点不必要。它似乎知道一个令牌在读取之前出现的次数远远看不出任何需要这样做。