Question

我有两个文件。

第一个文件（约400万个条目）有2列：[标签] [能量]
第二个文件（~200,000个条目）有2列：[上标签] [下标签]

例如：

文件1：

375677 4444.5              
375678 6890.4        
375679  786.0

文件2：

375677 375679      
375678 375679

我想用文件1中的'energy'值替换文件2中的'label'值，使文件2变为：

文件2（新）：

4444.5 786.0   
6890.4 786.0

或者将'energy'值添加到文件2，以便文件2变为：

文件2（替代）：

375677 375679 4444.5 786.0  
375678 375679 6890.4 786.0

必须有一种方法可以在python中执行此操作，但我的大脑无法正常工作。

到目前为止，我已写过

from sys import argv   
from scanfile import scanner   
class UnknownCommand(Exception): pass   

def processLine(line):       
  if line.startswith('23'):   
    print line[0:-1]

filename = 'test1.txt'   
if len(argv) == 2: filename = argv[1]   
scanner (filename, processLine)   

where scanfile is:

def scanner(name, function):   
  file = open(name, 'r')   
  while True:   
    line = file.readline()   
    if not line: break   
    function(line)   
  file.close()

这允许我通过从文件2（例如23）手动插入标签来搜索和打印文件1中的标签+值。毫无意义且耗时。

我需要编写一个部分，从文件2中读取标签并连续将它们放入'line.startswith（'lable'），直到文件2的结尾。

有什么建议吗？

感谢您的帮助。

Answer 1

假设file1中的标签是唯一的，我会首先将该文件读入字典：

with open('file1') as fd:
    data1 = dict(line.strip().split()
                 for line in fd if line.strip())

这会为字典data1提供如下内容：

{
  '375677': '4444.5',
  '375678': '6890.4',
  '375679': '786.0',
}

现在，请仔细阅读file2，执行相应的修改你遍历文件：

with open('file2') as fd:
    for line in fd:
        data = line.strip().split()
        print data1[data[0]], data1[data[1]]

或者，替代方案：

with open('file2') as fd:
    for line in fd:
        data = line.strip().split()
        print ' '.join(data), data1[data[0]], data1[data[1]]

Answer 2

只有当4M条目对你的记忆太多时，这种方法才值得采取

从所有File2 ID（上部和下部）
循环遍历大文件（File1）并使用地图中的条目创建一个仅的词典
再次在File2上循环并构建输出文件

一些代码来演示它：

s = set()
with open('File2') as file2:
    for line in file2:
        for i in line.split():
            s.add(i)
d = {}
with open('File1') as file1:
    for line in file1:
        k,v = line.split()
        if k in s:
            d[k] = v
with open('NewFile2', 'w') as out_file:
    with open('File2') as file2:
        for line in file2:
            k1,k2 = line.split()
            out_file.write(' '.join([k1,k2,d[k1],d[k2]]))

Python在一个文本文件中搜索值，将它们与另一个文本文件中的值进行比较，然后在匹配时替换值

2 个答案: