Python - 确定字符串的频率并进一步处理

时间:2013-04-27 16:56:23

标签: python arrays text-processing

我有一些包含可变列号的文本文件,由\t(制表符)分隔。像这样:

value1x1 . . . . . . value1xn
   .     . . . . . . value2xn
   .     . . . . . .     .
valuemx1 . . . . . . valuemxn

我可以使用以下代码扫描并确定值的频率;

f2 = open("out_freq.txt", 'w')
f = open("input_raw",'r')
whole_content = (f.read())
list_content = whole_content.split()
dict = {}
for one_word in list_content:
    dict[one_word] = 0
for one_word in list_content:
    dict[one_word] += 1
a = str(sorted(dict.items(),key=func))
f2.write(a)
f2.close()

并输出如下:

('26047', 13), ('42810', 13), ('61080', 13), ('106395', 13), ('102395', 13)...

这是('value', occurence_number)的语法,它按预期工作。我想要实现的是:

  1. 按以下语法转换输出:('value', occurrence_number, column_number)其中列号是input_raw.txt中出现此值的列号

  2. 将具有相同出现次数的值分组以分隔列并将这些值写入其他文件

1 个答案:

答案 0 :(得分:0)

如果我理解你想要以下内容:

import itertools as it
from collections import Counter

with open("input_raw",'r') as fin, open("out_freq.txt", 'w') as fout:
    counts = Counter(it.chain.from_iterable(enumerate(line.split())
                                                  for line in fin))
    sorted_items = sorted(counts.items(), key=lambda x: x[1], reverse=True)
    a = ', '.join(str((int(key[1]), val, key[0])) for key, val in sorted_items))
    fout.write(a)

请注意,此代码使用元组作为键,以区分相等的值(如果它们出现在不同的列中)。从你的问题不清楚这是否可能以及在这种情况下应该做些什么。

使用示例:

>>> import itertools as it
>>> from collections import Counter
>>> def get_sorted_items(fileobj):
...     counts = Counter(it.chain.from_iterable(enumerate(line.split()) for line in fileobj))
...     return sorted(counts.items(), key=lambda x:x[1], reverse=True)
... 
>>> data = """
... 10 11 12 13 14
... 10 9  7  6  4
... 9  8  12 13 0
... 10 21 33 6  1
... 9  9  7  13 14
... 1  21 7  13 0
... """
>>> with open('input.txt', 'wt') as fin:  #write data to the input file
...     fin.write(data)
... 
>>> with open('input.txt', 'rt') as fin:
...     print ', '.join(str((int(key[1]), val, key[0])) for key, val in get_sorted_items(fin))
... 
(13, 4, 3), (10, 3, 0), (7, 3, 2), (14, 2, 4), (6, 2, 3), (9, 2, 0), (0, 2, 4), (9, 2, 1), (21, 2, 1), (12, 2, 2), (8, 1, 1), (1, 1, 4), (1, 1, 0), (33, 1, 2), (4, 1, 4), (11, 1, 1)