Question

我目前有两组数据文件，如下所示：

文件1：

test1 ba ab cd dh gf
test2 fa ab cd dh gf
test3 rt ty er wq ee
test4 er rt sf sd sa

和文件2：

test1 123 344 123
test1 234 567 787
test1 221 344 566
test3 456 121 677

我想根据第一列中的数学行合并文件（以便匹配“测试”）

像这样：

test1 ba ab cd dh gf 123 344 123
test1 ba ab cd dh gf 234 567 787
test1 ba ab cd dh gf 221 344 566
test3 rt ty er wq ee 456 121 677

我有此密码

def combineFiles(file1,file2,outfile):
      def read_file(file):
         data = {}
         for line in csv.reader(file):
            data[line[0]] = line[1:]
         return data
      with open(file1, 'r') as f1, open(file2, 'r') as f2:
         data1 = read_file(f1)
         data2 = read_file(f2)
         with open(outfile, 'w') as out:
            wtr= csv.writer(out)
            for key in data1.keys():
               try:
                  wtr.writerow(((key), ','.join(data1[key]), ','.join(data2[key])))
               except KeyError:
                  pass

但是输出最终看起来像这样：

test1 ba ab cd dh gf 123 344 123
test3 er rt sf sd sa 456 121 677

有人可以帮助我制作输出，以便test1可以全部打印三遍吗？

非常感谢

Answer 1

您可能想尝试Pandas库；这样可以简化这种情况：

>>> import pandas as pd
>>> pd.merge(df1, df2, on='testnum', how='inner')
  testnum 1_x 2_x 3_x   4   5  1_y  2_y  3_y
0   test1  ba  ab  cd  dh  gf  123  344  123
1   test1  ba  ab  cd  dh  gf  234  567  787
2   test1  ba  ab  cd  dh  gf  221  344  566
3   test3  rt  ty  er  wq  ee  456  121  677

这假设测试列名为“ testnum”。

>>> df1
  testnum   1   2   3   4   5
0   test1  ba  ab  cd  dh  gf
1   test2  fa  ab  cd  dh  gf
2   test3  rt  ty  er  wq  ee
3   test4  er  rt  sf  sd  sa

>>> df2
  testnum    1    2    3
0   test1  123  344  123
1   test1  234  567  787
2   test1  221  344  566
3   test3  456  121  677

您将使用pd.read_csv()阅读这些内容。

Answer 2

尽管我建议Brad Solomon's approach简洁明了，但是您只需要对代码进行一些改动。

由于第二个文件是具有“最终决定权”的文件，因此您只需要为第一个文件创建字典。然后，您可以在读取第二个文件时写入输出文件，并在进行过程中从data1字典中获取值：

with open(file1, 'r') as f1, open(file2, 'r') as f2:
    data1 = read_file(f1)
    with open(outfile, 'w') as out:
        wtr = csv.writer(out, delimiter=' ')
        for line in csv.reader(f2, delimiter=' '):
            # only write if there is a corresponding line in file1
            if line[0] in data1:
                # as you write, get the corresponding file1 data
                wtr.writerow(line[0:] + data1[line[0]] + line[1:])

Answer 3

问题是您要覆盖行中的键

data[line[0]] = line[1:]

由于文件具有非唯一的“密钥”，因此您可以尝试使用enumerate手动将它们设置为唯一：

for ind, line in enumerate(csv.reader(file)):
    unique_key = ''.join([line[0], "_", str(ind)])
    data[unique_key] = line[1:]

稍后，当您合并结果时，可以剥离键以删除下划线后的所有内容：

wtr.writerow(((key.split("_")[0], ','.join(data1[key]), ','.join(data2[key])))

以我的口味，这一切都很笨拙。如果您的目标是在csv文件中读取，操作和写入数据，我建议您研究pandas，因为可以使用DataFrames在几行代码中编写此代码（请参见Brad Solomon。

Answer 4

您可以尝试将项目收集到单独的collections.defaultdict()中，然后使用itertools.product()获得相交行的笛卡尔积：

from collections import defaultdict
from itertools import product

def collect_rows(file):
    collection = defaultdict(list)

    for line in file:
        col1, *rest = line.split()
        collection[col1].append(rest)

    return collection

with open("file1.txt") as f1, open("file2.txt") as f2, open("output.txt", "w") as out:
    f1_collection = collect_rows(f1)
    f2_collection = collect_rows(f2)

    # Ordered intersection, no need to sort
    set_2 = set(f2_collection)
    intersection = [key for key in f1_collection if key in set_2]

    for key in intersection:
        for x, y in product(f1_collection[key], f2_collection[key]):
            out.write("%s\n" % " ".join([key] + x + y))

其中提供以下 output.txt ：

test1 ba ab cd dh gf 123 344 123
test1 ba ab cd dh gf 234 567 787
test1 ba ab cd dh gf 221 344 566
test3 rt ty er wq ee 456 121 677

注意：采用Brad Solomon's Pandas方法可能更容易，因为它可以用一个命令完成。

根据匹配的Python第一列合并数据

4 个答案: