Question

我有两个文本文件（A和B），如下所示：

A:
1 stringhere 5
1 stringhere 3
...
2 stringhere 4
2 stringhere 4
...

B:
1 stringhere 4
1 stringhere 5
...
2 stringhere 1
2 stringhere 2
...

我要做的就是阅读这两个文件，而不是像这样的新文本文件：

1 stringhere 5
1 stringhere 3
...
1 stringhere 4
1 stringhere 5
...
2 stringhere 4
2 stringhere 4
...
2 stringhere 1
2 stringhere 2
...

使用for循环，我创建了函数（使用Python）：

def find(arch, i):
    l = arch   
    for line in l:
        lines = line.split('\t')
        if i == int(lines[0]):
           write on the text file
        else:            
            break

然后我调用这样的函数：

for i in range(1,3):        
    find(o, i)
    find(r, i)

我会丢失一些数据，因为读取了包含不同数字的第一行，但它不在最终的.txt文件中。在这个例子中，2个字符串4和2个字符串1都丢失了。

有什么方法可以避免这种情况吗？

提前致谢。

Answer 1

如果文件适合内存：

with open('A') as file1, open('B') as file2:
     L = file1.read().splitlines() 
     L.extend(file2.read().splitlines()) 
L.sort(key=lambda line: int(line.partition(' ')[0])) # sort by 1st column
print("\n".join(L)) # print result

如果总行数低于一百万，这是一种有效的方法。否则，特别是如果你有很多排序的文件;你可以使用heapq.merge() to combine them。

Answer 2

在你的循环中，当行没有以与i相同的值开始时，你已经消耗了一行，所以当用i+1第二次调用该函数时，它从第二个有效行开始。

在阅读之前阅读内存中的所有文件（参见@ J.F.Sebastian的回答），或者，如果这不是一个选项，请用以下内容替换你的函数：

def find(arch, i):
    l = arch
    while True:
        line=l.readline()
        lines = line.split('\t')
        if line != "" and i == int(lines[0]): # Need to catch end of file
            print " ".join(lines),
        else:
            l.seek(-len(line), 1) # Need to 'unread' the last read line
            break

此版本“重新启动”光标，以便下一次调用readline再次读取正确的行。请注意，不鼓励将隐式for line in l与seek调用混合，因此while True。

例如：

$ cat t.py
o = open("t1")
r = open("t2")
print o
print r


def find(arch, i):
    l = arch
    while True:
        line=l.readline()
        lines = line.split(' ')
        if line != "" and i == int(lines[0]):
            print " ".join(lines),
        else:
            l.seek(-len(line), 1)
            break

for i in range(1, 3):
    find(o, i)
    find(r, i)

$ cat t1 
1 stringhere 1
1 stringhere 2
1 stringhere 3
2 stringhere 1
2 stringhere 2
$ cat t2
1 stringhere 4
1 stringhere 5
2 stringhere 1
2 stringhere 2
$ python t.py
<open file 't1', mode 'r' at 0x100261e40>
<open file 't2', mode 'r' at 0x100261ed0>
1 stringhere 1
1 stringhere 2
1 stringhere 3
1 stringhere 4
1 stringhere 5
2 stringhere 1
2 stringhere 2
2 stringhere 1
2 stringhere 2
$

Answer 3

可能有一种不太复杂的方法来实现这一目标。以下内容还按行显示的顺序保留行，如您所希望的那样。

lines = []
lines.extend(open('file_a.txt').readlines())
lines.extend(open('file_b.txt').readlines())
lines = [line.strip('\n') + '\n' for line in lines]
key = lambda line: int(line.split()[0])
open('out_file.txt', 'w').writelines(sorted(lines, key=key))

前三行将输入文件读入单行数组。

第四行确保每行在末尾只有一个换行符。如果您确定两个文件都以换行符结尾，则可以省略此行。

第五行将排序键定义为字符串第一个单词的整数版本。

第六行对行进行排序并将结果写入输出文件。

通过迭代从两个文本文件中丢失行

3 个答案: