是否可以拆分文件?例如,你有一个巨大的wordlist,我想拆分它,使它成为多个文件。这怎么可能?
答案 0 :(得分:17)
这个文件按换行符分割文件并将其写回。您可以轻松更改分隔符。如果输入文件中没有多个splitLen行(本例中为20),这也可以处理不均匀的数量。
splitLen = 20 # 20 lines per file
outputBase = 'output' # output.1.txt, output.2.txt, etc.
# This is shorthand and not friendly with memory
# on very large files (Sean Cavanagh), but it works.
input = open('input.txt', 'r').read().split('\n')
at = 1
for lines in range(0, len(input), splitLen):
# First, get the list slice
outputData = input[lines:lines+splitLen]
# Now open the output file, join the new slice with newlines
# and write it out. Then close the file.
output = open(outputBase + str(at) + '.txt', 'w')
output.write('\n'.join(outputData))
output.close()
# Increment the counter
at += 1
答案 1 :(得分:15)
sli示例的更好循环,而不是占用内存:
splitLen = 20 # 20 lines per file
outputBase = 'output' # output.1.txt, output.2.txt, etc.
input = open('input.txt', 'r')
count = 0
at = 0
dest = None
for line in input:
if count % splitLen == 0:
if dest: dest.close()
dest = open(outputBase + str(at) + '.txt', 'w')
at += 1
dest.write(line)
count += 1
答案 2 :(得分:7)
将二进制文件拆分为.000,.001等章节的解决方案:
FILE = 'scons-conversion.7z'
MAX = 500*1024*1024 # 500Mb - max chapter size
BUF = 50*1024*1024*1024 # 50GB - memory buffer size
chapters = 0
uglybuf = ''
with open(FILE, 'rb') as src:
while True:
tgt = open(FILE + '.%03d' % chapters, 'wb')
written = 0
while written < MAX:
if len(uglybuf) > 0:
tgt.write(uglybuf)
tgt.write(src.read(min(BUF, MAX - written)))
written += min(BUF, MAX - written)
uglybuf = src.read(1)
if len(uglybuf) == 0:
break
tgt.close()
if len(uglybuf) == 0:
break
chapters += 1
答案 3 :(得分:2)
当然可以:
open input file
open output file 1
count = 0
for each line in file:
write to output file
count = count + 1
if count > maxlines:
close output file
open next output file
count = 0
答案 4 :(得分:2)
def split_file(file, prefix, max_size, buffer=1024):
"""
file: the input file
prefix: prefix of the output files that will be created
max_size: maximum size of each created file in bytes
buffer: buffer size in bytes
Returns the number of parts created.
"""
with open(file, 'r+b') as src:
suffix = 0
while True:
with open(prefix + '.%s' % suffix, 'w+b') as tgt:
written = 0
while written < max_size:
data = src.read(buffer)
if data:
tgt.write(data)
written += buffer
else:
return suffix
suffix += 1
def cat_files(infiles, outfile, buffer=1024):
"""
infiles: a list of files
outfile: the file that will be created
buffer: buffer size in bytes
"""
with open(outfile, 'w+b') as tgt:
for infile in sorted(infiles):
with open(infile, 'r+b') as src:
while True:
data = src.read(buffer)
if data:
tgt.write(data)
else:
break
答案 5 :(得分:1)
import re
PATENTS = 'patent.data'
def split_file(filename):
# Open file to read
with open(filename, "r") as r:
# Counter
n=0
# Start reading file line by line
for i, line in enumerate(r):
# If line match with teplate -- <?xml --increase counter n
if re.match(r'\<\?xml', line):
n+=1
# This "if" can be deleted, without it will start naming from 1
# or you can keep it. It depends where is "re" will find at
# first time the template. In my case it was first line
if i == 0:
n = 0
# Write lines to file
with open("{}-{}".format(PATENTS, n), "a") as f:
f.write(line)
split_file(PATENTS)
结果你会得到:
patent.data-0
patent.data-1
patent.data-N
答案 6 :(得分:1)
您可以使用此pypi filesplit模块。
答案 7 :(得分:0)
这是一个较晚的答案,但此处链接了一个新问题,没有任何答案提到itertools.groupby
。
假设您有一个(巨大)文件file.txt
,您希望将其分成MAXLINES
行file_part1.txt
,...,file_partn.txt
的大块,则可以执行以下操作:
with open(file.txt) as fdin:
for i, sub in itertools.groupby(enumerate(fdin), lambda x: 1 + x[0]//3):
fdout = open("file_part{}.txt".format(i))
for _, line in sub:
fdout.write(line)
答案 8 :(得分:0)
提供的所有答案都很好,并且(可能)有效,但是,它们需要将文件加载到内存中(全部或部分)。我们知道Python在这类任务中效率不高(或者至少不如OS级命令效率高)。
我发现以下是最有效的方法:
import os
MAX_NUM_LINES = 1000
FILE_NAME = "input_file.txt"
SPLIT_PARAM = "-d"
PREFIX = "__"
if os.system(f"split -l {MAX_NUM_LINES} {SPLIT_PARAM} {FILE_NAME} {PREFIX}") == 0:
print("Done:")
print(os.system(f"ls {PREFIX}??"))
else:
print("Failed!")
在https://linoxide.com/linux-how-to/split-large-text-file-smaller-files-linux/上了解有关split
的更多信息