我正在处理包含数百万条记录的两个文件。只是分享测试数据来解释我面临的问题。例如,tx_match.txt包含所有记录。并且txid_time.txt只有一些具有时间戳的记录。我想要的输出如下所示,我们的想法是合并主数据库中的附加列信息。请注意,我不允许使用pandas库。
tx_match.txt
col1 col2 col3 col4
171 9 9 5000000000
183 171 9 4000000000
185 183 9 3000000000
187 185 9 2900000000
192 187 187 100000000
227 185 185 100000000
255 187 9 2800000000
504 367 367 5000000000
504 192 192 100000000
504 255 255 1000000000
533 293 293 5000000000
555 533 533 2500000000
txid_time.txt
col1 col2
227 2017-02-10
255 2017-01-10
504 2017-02-09
我想要的输出是:
227 185 185 100000000 2017-02-10
255 187 9 2800000000 2017-01-10
504 367 367 5000000000 2017-02-09
504 192 192 100000000 2017-02-09
504 255 255 1000000000 2017-02-09
到目前为止,我已经这样做了:
import csv
d={}
fin = open("txid_match.txt","r")
for line in fin:
try:
line = line.rstrip()
f = line.split("\t")
k=f[0]
v=f[1]
d[k]=v
except IndexError:
continue
fin.close()
#print(d)
fin = open("txid_time.txt","r")
fout = open("txmatch_time.txt",'w')
foutWriter=csv.writer(fout)
for line in fin:
try:
line = line.rstrip()
f = line.split("\t")
txid=f[0]
prvtxid=d[txid]
foutWriter.writerow([f[0]+"\t"+f[1]+"\t"+prvtxid])
except IndexError:
continue
except KeyError:
continue
fin.close()
fout.close()
提前感谢您的支持。
答案 0 :(得分:0)
您的解决方案将有效。但是,它需要最佳的线性空间复杂性。以下解决方案改进了它以实现最佳情况恒定空间复杂性。它还可以更好地利用自动化上下文管理器(with
语句)以及csv
包的Reader
和Writer
的自动解析和连接功能。 (请注意,为了清楚起见,我遗漏了IndexError
和KeyError
处理;如果需要,您可能需要自己添加
import csv
col_delim = '\t'
row_delim = '\n'
with open('txid_time.txt', 'r') as ftime, open('tx_match.txt', 'r') as fmatch, open('txmatch_time.txt', 'w') as fmerge:
rtime = csv.reader(ftime, delimeter=col_delim, lineterminator=row_delim)
rmatch = csv.reader(fmatch, delimeter=col_delim, lineterminator=row_delim)
wmerge = csv.writer(fmerge, delimeter=col_delim, lineterminator=row_delim)
try:
time = next(rtime)
match = next(rmatch)
continue_ = True
while continue_:
while time[0] < match[0]:
time = next(rtime)
while time[0] > match[0]:
match = next(rmatch)
if time[0] == match[0]:
key = time[0]
times = []
try:
while time[0] == key:
times.append(time)
time = next(rtime)
except StopIteration:
continue_ = False
matches = []
try:
while match[0] == key:
matches.append(match)
match = next(rmatch)
except StopIteration:
continue_ = False
for match in matches:
for time in times:
merge = match + time[1:2]
wmerge.writerow(merge)
except StopIteration:
pass