我一直在寻找Python / Unix Command中的一些方法,通过基于第一个单词值分组来调整大型文本数据集,如下所示 -
输入文字:
"ABC", 21, 15, 45
"DEF", 35, 3, 35
"DEF", 124, 33, 5
"QQQ" , 43, 54, 35
"XZZ", 43, 35 , 32
"XZZ", 45 , 35, 32
所以它会随机改组,但保持小组在一起如下
输出样本 -
"QQQ" , 43, 54, 35
"XZZ", 43, 35 , 32
"XZZ", 45 , 35, 32
"ABC", 21, 15, 45
"DEF", 35, 3, 35
"DEF", 124, 33, 5
我通过正常洗牌找到了解决方案,但是我没有想到在洗牌时保持小组。
答案 0 :(得分:3)
可以使用collections.defaultdict来完成。通过按照第一个序列识别每一行,您可以轻松地对它们进行排序,然后只对字典的键进行采样,如下所示:
import random
from collections import defaultdict
# Read all the lines from the file
lines = defaultdict(list)
with open("/path/to/file", "r") as in_file:
for line in in_file:
s_line = line.split(",")
lines[s_line[0]].append(line)
# Randomize the order
rnd_keys = random.sample(lines.keys(), len(lines))
# Write back to the file?
with open("/path/to/file", "w") as out_file:
for k in rnd_keys:
for line in lines[k]:
out_file.write(line)
希望这有助于你的努力。
答案 1 :(得分:2)
您还可以将文件中的每一行存储到嵌套列表中:
lines = []
with open('input_text.txt') as in_file:
for line in in_file.readlines():
line = [x.strip() for x in line.strip().split(',')]
lines.append(line)
给出了:
[['"ABC"', '21', '15', '45'], ['"DEF"', '35', '3', '35'], ['"DEF"', '124', '33', '5'], ['"QQQ"', '43', '54', '35'], ['"XZZ"', '43', '35', '32'], ['"XZZ"', '45', '35', '32']]
然后,您可以使用itertools.groupby()
第一项对这些列表进行分组:
import itertools
from operator import itemgetter
grouped = [list(g) for _, g in itertools.groupby(lines, key = itemgetter(0))]
其中列出了您的分组项目:
[[['"ABC"', '21', '15', '45']], [['"DEF"', '35', '3', '35'], ['"DEF"', '124', '33', '5']], [['"QQQ"', '43', '54', '35']], [['"XZZ"', '43', '35', '32'], ['"XZZ"', '45', '35', '32']]]
然后你可以用random.shuffle()
来填充它:
import random
random.shuffle(grouped)
其中包含完整的分组项目的随机列表:
[[['"QQQ"', '43', '54', '35']], [['"ABC"', '21', '15', '45']], [['"XZZ"', '43', '35', '32'], ['"XZZ"', '45', '35', '32']], [['"DEF"', '35', '3', '35'], ['"DEF"', '124', '33', '5']]]
现在你要做的就是压扁最终列表并将其写入一个新文件,你可以使用itertools.chain.from_iterable()
:
with open('output_text.txt', 'w') as out_file:
for line in itertools.chain.from_iterable(grouped):
out_file.write(', '.join(line) + '\n')
print(open('output_text.txt').read())
哪个a为您的文件提供了新的随机播放版本:
"QQQ", 43, 54, 35
"ABC", 21, 15, 45
"XZZ", 43, 35, 32
"XZZ", 45, 35, 32
"DEF", 35, 3, 35
"DEF", 124, 33, 5