Question

我有两个不同长度的文件，但都有3个制表符分隔的列。 File2的行数类似于5,000,000,000，File1中的行数为2,000,000：

File1中：

abc foo bar
lmn potato rst
lmp tomato  asd

文件2：

123 asdasc  dad
032 foo 2134
123 linkin  9123
42  cads    asd
45654   tomato  12123

我需要使用第二列作为匹配两个文件的键，如果第二列匹配，则需要从File1和File2中提取行

fout = open('outfile', 'w')
with open('file1', 'r') as f1, open('file2', 'r') as f2:
  file2_keys = [i.split('\t')[1] for i in f2]
  for line in f1:
    if line.split('\t')[1] in file2_keys:
      print>>fout, line

但这只给了我File1的一行。

所需的输出应为

Outfile2：

032 foo 2134
45654   tomato  12123

Outfile1：

abc foo bar
lmp tomato  asd

有没有办法在unix中有效地执行此操作？如何有效地完成此操作？

Answer 1

有没有办法在unix中有效地做到这一点？

您可以使用awk。

awk 'NR==FNR{a[$2]=$2;next}{if ($2 in a) {print $0}}' File1 File2

将从File1产生所需的输出：

032 foo 2134
45654   tomato 12123

类似地，

awk 'NR==FNR{a[$2]=$2;next}{if ($2 in a) {print $0}}' File2 File1

将从File2产生所需的输出：

abc foo bar
lmp tomato  asd

Answer 2

有效？使用awk作为0xdeadbeef建议或坚持使用C ++：

#include <iostream>
#include <fstream>
#include <set>

int main() {

    std::string a, b, c;
    std::set<std::string> s;

    std::ifstream file1("File1");
    while (file1 >> a >> b >> c)
        s.insert(b);

    std::ifstream file2("File2");
    while (file2 >> a >> b >> c)
        if (s.count(b) != 0)
            std::cout << a << "\t" << b << "\t" << c << std::endl;

}

输出

032 foo 2134
45654   tomato  12123

Answer 3

既然你添加了一个python标签并在python中试过它，那么这是一个python的实现：

fout1 = open('outfile1', 'w')
fout2 = open('outfile2', 'w')
with open('file1') as f1, open('file2') as f2:
    file2_lines = [i.strip() for i in f2]
    file2_keys = [i.split(' ')[1] for i in file2_lines]
    d = dict(zip(file2_keys, file2_lines))
    for line in f1:
        k = line.split(' ')[1]
        if k in d:
            print >>fout1, line.strip()
            print >>fout2, d[k]
fout1.close()
fout2.close()

从两个文件中的特定列中查找匹配项

3 个答案: