按文件组对文件进行混洗

时间:2018-01-21 01:42:38

标签: python shell unix random shuffle

我一直在寻找Python / Unix Command中的一些方法,通过基于第一个单词值分组来调整大型文本数据集,如下所示 -

输入文字:

"ABC", 21, 15, 45
"DEF", 35, 3, 35
"DEF", 124, 33, 5
"QQQ" , 43, 54, 35
"XZZ", 43, 35 , 32
"XZZ", 45 , 35, 32

所以它会随机改组,但保持小组在一起如下

输出样本 -

"QQQ" , 43, 54, 35  
"XZZ", 43, 35 , 32
"XZZ", 45 , 35, 32
"ABC", 21, 15, 45
"DEF", 35, 3, 35
"DEF", 124, 33, 5

我通过正常洗牌找到了解决方案,但是我没有想到在洗牌时保持小组。

2 个答案:

答案 0 :(得分:3)

可以使用collections.defaultdict来完成。通过按照第一个序列识别每一行,您可以轻松地对它们进行排序,然后只对字典的键进行采样,如下所示:

import random
from collections import defaultdict

# Read all the lines from the file
lines = defaultdict(list)
with open("/path/to/file", "r") as in_file:
    for line in in_file:
        s_line = line.split(",")
        lines[s_line[0]].append(line)

# Randomize the order
rnd_keys = random.sample(lines.keys(), len(lines))

# Write back to the file?
with open("/path/to/file", "w") as out_file:
    for k in rnd_keys:
        for line in lines[k]:
            out_file.write(line)

希望这有助于你的努力。

答案 1 :(得分:2)

您还可以将文件中的每一行存储到嵌套列表中:

lines = []
with open('input_text.txt') as in_file:
    for line in in_file.readlines():
        line = [x.strip() for x in line.strip().split(',')]
        lines.append(line)

给出了:

[['"ABC"', '21', '15', '45'], ['"DEF"', '35', '3', '35'], ['"DEF"', '124', '33', '5'], ['"QQQ"', '43', '54', '35'], ['"XZZ"', '43', '35', '32'], ['"XZZ"', '45', '35', '32']]

然后,您可以使用itertools.groupby()第一项对这些列表进行分组:

import itertools
from operator import itemgetter

grouped = [list(g) for _, g in itertools.groupby(lines, key = itemgetter(0))]

其中列出了您的分组项目:

[[['"ABC"', '21', '15', '45']], [['"DEF"', '35', '3', '35'], ['"DEF"', '124', '33', '5']], [['"QQQ"', '43', '54', '35']], [['"XZZ"', '43', '35', '32'], ['"XZZ"', '45', '35', '32']]]

然后你可以用random.shuffle()来填充它:

import random

random.shuffle(grouped)

其中包含完整的分组项目的随机列表:

[[['"QQQ"', '43', '54', '35']], [['"ABC"', '21', '15', '45']], [['"XZZ"', '43', '35', '32'], ['"XZZ"', '45', '35', '32']], [['"DEF"', '35', '3', '35'], ['"DEF"', '124', '33', '5']]]

现在你要做的就是压扁最终列表并将其写入一个新文件,你可以使用itertools.chain.from_iterable()

with open('output_text.txt', 'w') as out_file:
    for line in itertools.chain.from_iterable(grouped):
        out_file.write(', '.join(line) + '\n')

print(open('output_text.txt').read())

哪个a为您的文件提供了新的随机播放版本:

"QQQ", 43, 54, 35
"ABC", 21, 15, 45
"XZZ", 43, 35, 32
"XZZ", 45, 35, 32
"DEF", 35, 3, 35
"DEF", 124, 33, 5