Question

我正在处理包含数百万条记录的两个文件。只是分享测试数据来解释我面临的问题。例如，tx_match.txt包含所有记录。并且txid_time.txt只有一些具有时间戳的记录。我想要的输出如下所示，我们的想法是合并主数据库中的附加列信息。请注意，我不允许使用pandas库。

tx_match.txt

col1  col2  col3      col4
171    9    9    5000000000
183    171    9    4000000000
185    183    9    3000000000
187    185    9    2900000000
192    187    187  100000000
227    185    185  100000000
255    187    9    2800000000
504    367    367  5000000000
504    192    192  100000000
504    255    255  1000000000
533    293    293  5000000000
555    533    533  2500000000

txid_time.txt

col1      col2
227     2017-02-10
255     2017-01-10
504     2017-02-09

我想要的输出是：

227    185     185     100000000   2017-02-10
255    187     9       2800000000  2017-01-10 
504    367     367     5000000000  2017-02-09
504    192     192     100000000   2017-02-09
504    255     255     1000000000  2017-02-09

到目前为止，我已经这样做了：

import csv 
d={}
fin = open("txid_match.txt","r")
for line in fin:
    try:
        line = line.rstrip()
        f = line.split("\t")
        k=f[0]
        v=f[1]
        d[k]=v
    except IndexError:
        continue

fin.close()
#print(d)
fin = open("txid_time.txt","r")
fout = open("txmatch_time.txt",'w')
foutWriter=csv.writer(fout)
for line in fin:
    try:
         line = line.rstrip()
         f = line.split("\t")
         txid=f[0]
         prvtxid=d[txid]    
         foutWriter.writerow([f[0]+"\t"+f[1]+"\t"+prvtxid])
    except IndexError:
         continue
    except KeyError:
         continue
fin.close()    
fout.close()

提前感谢您的支持。

Answer 1

您的解决方案将有效。但是，它需要最佳的线性空间复杂性。以下解决方案改进了它以实现最佳情况恒定空间复杂性。它还可以更好地利用自动化上下文管理器（with语句）以及csv包的Reader和Writer的自动解析和连接功能。 （请注意，为了清楚起见，我遗漏了IndexError和KeyError处理;如果需要，您可能需要自己添加

import csv

col_delim = '\t'
row_delim = '\n'

with open('txid_time.txt', 'r') as ftime, open('tx_match.txt', 'r') as fmatch, open('txmatch_time.txt', 'w') as fmerge:
    rtime = csv.reader(ftime, delimeter=col_delim, lineterminator=row_delim)
    rmatch = csv.reader(fmatch, delimeter=col_delim, lineterminator=row_delim)
    wmerge = csv.writer(fmerge, delimeter=col_delim, lineterminator=row_delim)

    try:
        time = next(rtime)
        match = next(rmatch)
        continue_ = True
        while continue_:
            while time[0] < match[0]:
                time = next(rtime)
            while time[0] > match[0]:
                match = next(rmatch)
            if time[0] == match[0]:
                key = time[0]
                times = []
                try:
                    while time[0] == key:
                        times.append(time)
                        time = next(rtime)
                except StopIteration:
                    continue_ = False
                matches = []
                try:
                    while match[0] == key:
                        matches.append(match)
                        match = next(rmatch)
                except StopIteration:
                    continue_ = False
                for match in matches:
                   for time in times:
                       merge = match + time[1:2]
                       wmerge.writerow(merge)
    except StopIteration:
        pass

合并主数据库中的其他列信息以获取某些选择性记录

1 个答案: