我想遍历文本文件的内容,在分析文件内容时,它确定了必须将哪个新文件写入(及其内容应该是什么)。
我有一个可行的解决方案(请参见下面的代码),但我不知道这是否是一种最佳方法。具体来说,即使发生意外错误,我也想自动处理文件的关闭。我试图处理这种情况,但请参见代码注释:
如果此处出现问题怎么办,我们仍然可以打开文件 状态正确吗?
据我所知,关闭文件可以防止文件损坏。那是对的吗?关闭文件还有什么其他含义?
如果我可以确保文件没有损坏,那么我可以编写可以继续进行代码(可能需要一些手动调整,具体取决于调试日志中的内容),而无需重新开始。
有效的解决方案(see jupyter-notebook instead):
问题依存关系
# Ensure an empty directory for the execution of this question's code
tmp_dir = "/tmp/stackoverflow-question-55012211"
rm(tmp_dir, force=true, recursive=true)
mkdir(tmp_dir)
# Write example ".fakeq" files.
# In my real life problem, they would be ".fastq" (see https://en.wikipedia.org/wiki/FASTQ_format)
# and sample would not be known at this stage, simplifying to keep things relevant to question
open("$(tmp_dir)/pool1.fakeq", "w") do f
write(f, "id1_sample1_ACGTA\n")
write(f, "id2_sample3_CGTACG\n")
write(f, "id3_sample2_GTACTAC\n")
write(f, "id4_sample1_TACGGTAC\n")
write(f, "id5_sample2_ACGTGTACG\n")
write(f, "id6_sample3_CGTATACGTA\n")
write(f, "id7_sample2_GTACCGTAC\n")
write(f, "id8_sample1_TACGGTAC\n")
write(f, "id9_sample1_ACGTGTA\n")
end
open("$(tmp_dir)/pool2.fakeq", "w") do f
write(f, "id10_sample2_ACGTAACGTA\n")
write(f, "id11_sample1_CGTACGCGTACG\n")
write(f, "id12_sample3_GTACTACGTACTAC\n")
write(f, "id13_sample2_TACGGTACTACGGTAC\n")
write(f, "id14_sample1_ACGTGTACGACGTGTACG\n")
write(f, "id15_sample3_CGTATACGTACGTATACGTA\n")
write(f, "id16_sample2_GTACCGTACGTACCGTAC\n")
write(f, "id17_sample1_TACGGTACTACGGTAC\n")
write(f, "id18_sample1_ACGTGTAACGTGTA\n")
end
# This array can be in the order of 10 - 20 elements long
csv_header = [
"identifier",
"sample_name",
"sequence",
"sequence_length"
]
# This array can be in the order of 25 - 50 elements long.
# In real-life problem, we know this list of samples up front
# and sample_name is calculated by matching an array of nucleotide
# 'barcode' sequences up against each sequence in the .fastq files
sample_names = [
"sample1",
"sample2",
"sample3"
]
# This array can be in the order of 4 - 12 elements long
# In real-life problem, we know this list of pools up front and each
# pool corresponds to a .fastq file mentioned above
pool_list = [
"pool1",
"pool2"
]
# I am creating a mapping here so that a file is written in a location
# dependent on the sample name
# What if something goes wrong here, we could still have files in open state right?
# If inside the try block below, then potentially some files will be attempted to be
# closed before being opened
sample_csv_mapping = Dict(
sample_name => open("$(tmp_dir)/$(sample_name).csv", "w")
for sample_name in sample_names
)
主块
# An attempt to ensure that files are closed in case of error
try
# Initialises (overwrites) csv with header
for (sample, csv_stream) in sample_csv_mapping
write(csv_stream, join(csv_header, ","), "\n")
end
for pool in pool_list
# This automatically handles closing file upon error
open("$(tmp_dir)/$(pool).fakeq", "r") do f
lines = readlines(f)
for line in lines
identifier, sample_name, sequence = split(line, "_")
sequence_length = length(sequence)
csv_row = [
identifier,
sample_name,
sequence,
sequence_length
]
write(sample_csv_mapping[sample_name], join(csv_row, ","), "\n")
end
end
end
finally
println("Manually handle closing files whether upon successful run or upon error")
for (sample, csv_stream) in sample_csv_mapping
close(csv_stream)
end
end
答案 0 :(得分:1)
在您的代码中,似乎最好的选择似乎是在将每个数据块都写入流之后flush
。这将强制将字节写入磁盘,因此可以避免数据丢失:
flush(f)
因为您要求帮助编辑代码:
sample_names = Symbol.([
"sample1",
"sample2",
"sample3"
])
lastSample = :none
open("$(tmp_dir)/$(pool).fakeq", "r") do f
lines = readlines(f)
for line in lines
identifier, sample_name, sequence = split(line, "_")
sequence_length = length(sequence)
csv_row = [
identifier,
sample_name,
sequence,
sequence_length
if last_sample != :none || last_sample != sample_name
flush(sample_csv_mapping[last_sample])
last_sample = sample_name
end
write(sample_csv_mapping[sample_name], join(csv_row, ","), "\n")
end
end