Question

Answer 1

我刚刚在一个有4.3M线条的文件上试过这个，最快的事情就是＆＃39; shuf＆＃39; Linux上的命令。像这样使用它：

shuf huge_file.txt -o shuffled_lines_huge_file.txt

完成需要2-3秒。

Answer 2

这是使用random.choice的另一种方式，这可能会提供一些渐进的记忆减轻，但是更糟糕的Big-O：）

from random import choice

with open('data.txt', 'r') as r:
    lines = r.readlines()

with open('shuffled_data.txt', 'w') as w:
    while lines:
        l = choice(lines)
        lines.remove(l)
        w.write(l)

Answer 3

以下Vimscript可用于交换行：

function! Random()                                                       
  let nswaps = 100                                                       
  let firstline = 1                                                     
  let lastline = 10                                                      
  let i = 0                                                              
  while i <= nswaps                                                      
    exe "let line = system('shuf -i ".firstline."-".lastline." -n 1')[:-2]"
    exe line.'d'                                                         
    exe "let line = system('shuf -i ".firstline."-".lastline." -n 1')[:-2]"
    exe "normal! " . line . 'Gp'                                         
    let i += 1                                                           
  endwhile                                                               
endfunction

在可视模式下选择功能，然后输入:@"，然后使用:call Random()

执行该功能

Answer 4

这将达到目的：我的解决方案甚至不使用随机方法，它还会删除重复项。

import sys
lines= list(set(open(sys.argv[1]).readlines()))
print(' '.join(lines))

在外壳中

python shuffler.py nameoffilestobeshuffled.txt > shuffled.txt

Answer 5

这不是解决您的问题的必要方法。只是将其保留在这里，供那些来这里寻求解决方案以改组更大文件的人使用。但它也适用于较小的文件。将split -b 1GB更改为较小的文件大小，即split -b 100MB，以制作许多文本文件，每个文本文件的大小为100MB。

我有一个20GB的文件，其中包含超过15亿个句子。在Linux终端中调用shuf命令只会使我的16GB RAM和相同的交换区不堪重负。这是我为完成工作而编写的bash脚本。假定您将bash脚本与大文本文件保存在同一文件夹中。

#!/bin

#Create a temporary folder named "splitted" 
mkdir ./splitted


#Split input file into multiple small(1GB each) files
#This is will help us shuffle the data
echo "Splitting big txt file..."
split -b 1GB ./your_big_file.txt ./splitted/file --additional-suffix=.txt
echo "Done."

#Shuffle the small files
echo "Shuffling splitted txt files..."
for entry in "./splitted"/*.txt
do
  shuf $entry -o $entry
done
echo "Done."

#Concatinate the splitted shuffled files into one big text file
echo "Concatinating shuffled txt files into 1 file..."
cat ./splitted/* > ./your_big_file_shuffled.txt
echo "Done"

#Delete the temporary "splitted" folder
rm -rf ./splitted
echo "Complete."

11 个答案: