Question

我有一些包含可变列号的文本文件，由\t（制表符）分隔。像这样：

value1x1 . . . . . . value1xn
   .     . . . . . . value2xn
   .     . . . . . .     .
valuemx1 . . . . . . valuemxn

我可以使用以下代码扫描并确定值的频率;

f2 = open("out_freq.txt", 'w')
f = open("input_raw",'r')
whole_content = (f.read())
list_content = whole_content.split()
dict = {}
for one_word in list_content:
    dict[one_word] = 0
for one_word in list_content:
    dict[one_word] += 1
a = str(sorted(dict.items(),key=func))
f2.write(a)
f2.close()

并输出如下：

('26047', 13), ('42810', 13), ('61080', 13), ('106395', 13), ('102395', 13)...

这是('value', occurence_number)的语法，它按预期工作。我想要实现的是：

按以下语法转换输出：('value', occurrence_number, column_number)其中列号是input_raw.txt中出现此值的列号
将具有相同出现次数的值分组以分隔列并将这些值写入其他文件

Answer 1

如果我理解你想要以下内容：

import itertools as it
from collections import Counter

with open("input_raw",'r') as fin, open("out_freq.txt", 'w') as fout:
    counts = Counter(it.chain.from_iterable(enumerate(line.split())
                                                  for line in fin))
    sorted_items = sorted(counts.items(), key=lambda x: x[1], reverse=True)
    a = ', '.join(str((int(key[1]), val, key[0])) for key, val in sorted_items))
    fout.write(a)

请注意，此代码使用元组作为键，以区分相等的值（如果它们出现在不同的列中）。从你的问题不清楚这是否可能以及在这种情况下应该做些什么。

使用示例：

>>> import itertools as it
>>> from collections import Counter
>>> def get_sorted_items(fileobj):
...     counts = Counter(it.chain.from_iterable(enumerate(line.split()) for line in fileobj))
...     return sorted(counts.items(), key=lambda x:x[1], reverse=True)
... 
>>> data = """
... 10 11 12 13 14
... 10 9  7  6  4
... 9  8  12 13 0
... 10 21 33 6  1
... 9  9  7  13 14
... 1  21 7  13 0
... """
>>> with open('input.txt', 'wt') as fin:  #write data to the input file
...     fin.write(data)
... 
>>> with open('input.txt', 'rt') as fin:
...     print ', '.join(str((int(key[1]), val, key[0])) for key, val in get_sorted_items(fin))
... 
(13, 4, 3), (10, 3, 0), (7, 3, 2), (14, 2, 4), (6, 2, 3), (9, 2, 0), (0, 2, 4), (9, 2, 1), (21, 2, 1), (12, 2, 2), (8, 1, 1), (1, 1, 4), (1, 1, 0), (33, 1, 2), (4, 1, 4), (11, 1, 1)

Python - 确定字符串的频率并进一步处理

1 个答案: