在多个文件之间的特定列中查找公共元素

时间:2017-03-17 06:37:49

标签: python text awk

根据模拟过程中的uniq数量,我有多个已排序的文件。例: 文件1(第三列是126位长):

 12018647 
 290704 Instr1: 000000000000000000000000000000001010000111000010101001110000000000100001100101111011000000000000000000000000000000000000000001 
 276277 Instr1: 000000000000000000000000001100011110000000000111101000011000000000100000110110100101000000000000000000000000000000000000000001 
 248268 Instr1: 000000000001111111111111110100001110000000000000101000011000000000100001100101110010000000000000000000000000000000000000000001 
 230387 Instr1: 000001010111111111111111100100000000000101000100100110100000000000100001100101110011000000000000000000000000000000000000000001 
 229445 Instr1: 000000000000000000000000000000001010001011000000101000010000000000100001100101111001000000000000000000000000000000000000000001 
 224885 Instr1: 000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 
 218722 Instr1: 000000100110000000000000000100001110000000000110100100000000000000100000110110100011000000000000000000000000000000000000000001 
 216637 Instr1: 000000000000000000111100000100001010000000000000010100010000000000100001100101110101000000000000000000000000000000000000000001 
 211294 Instr1: 000000000000000000000000000000001010001111000110101011101000000000100001100101101111010000000000000000000000000000000000000001 
 201754 Instr1: 000000000000000000000000000000011010001001000000101000010000000000100001100101111010000000000000000000000000000000000000000001 
 199568 Instr1: 000001010111000110111100100100000000001001011100100110100000000000100001100101111000010000000000000000000000000000000000000001 
 192394 Instr1: 000000110111000110111100100100001010000000011100100101000000000000100001100101111111010000000000000000000000000000000000000001 
 156719 Instr1: 000001010111000110111100000100000000001011011100100110100000000000100001100101110100000000000000000000000000000000000000000001 
 154935 Instr1: 000000110111000110111011000100010110000000011100100101000000000000100001100101110001000000000000000000000000000000000000000001 
 152440 Instr1: 000000110111111111111111100100001010000000000011100101000000000000100001100101111101100000000000000000000000000000000000000001 
 150409 Instr1: 000000110111000110111100100100001110000000011100100101000000000000100001100101110111010000000000000000000000000000000000000001 
 142168 Instr1: 000000110111000110111010100100011010000000011100100101000000000000100001100101101110010000000000000000000000000000000000000001 
 127784 Instr1: 000001010110000000000000000100000000000101000110100110100000000000100000010101000110010000000000000000000000000000000000000001 
 126609 Instr1: 000000110110000000000000100100001010000000000011100101000000000000100000010101001000110000000000000000000000000000000000000001 
 107861 Instr1: 000000000000000000000000000000011010000101000000101000010000000000100000010101000101010000000000000000000000000000000000000001 
  97748 Instr1: 000000110110000000000000100101001010000000010010100101000000000000100000010101000111010000000000000000000000000000000000000001 
  96644 Instr1: 000000100110000000000000000100001010000000000110100100000000000000100000110110100100000000000000000000000000000000000000000001 
  89944 Instr1: 000000110111000110011110000100001010000000011100100101000000000000100000110111010101000000000000000000000000000000000000000001 
  84330 Instr1: 000000000000000000011111111100001010000000000010101001111000000000100001100111111100000000000000000000000000000000000000000001 
  81039 Instr1: 000000000000000000000001100100010010000000000000101000011000000000100000010101000100110000000000000000000000000000000000000001 
  77980 Instr1: 000000100110000000000000001100001010000000010001100100000000000000100000010110010000000000000000000000000000000000000000000001 
  76378 Instr1: 000000110110000000000000100101000010000000000100100101000000000000100000010111010010000000000000000000000000000000000000000001 
  68031 Instr1: 000000110111000110011110100100001110000000011100100101000000000000100000110111010010100000000000000000000000000000000000000001 
  67762 Instr1: 000000000000000000000000000000010010100001000000101000010000000000100000010111010010110000000000000000000000000000000000000001 
  66508 Instr1: 000001010110000000000000000100000000000001000100100110100000000000100000110110111110000000000000000000000000000000000000000001 
  59293 Instr1: 000000000000000000000000000000010010100001000000101000010000000000100000010101010001110000000000000000000000000000000000000001 
  57900 Instr1: 000000110110000000000000100101000010000000000100100101000000000000100000010101010001000000000000000000000000000000000000000001 
  56217 Instr1: 000000110111000000011100000100001010000000011100100101000000000000100001011001110000110000000000000000000000000000000000000001 
  56113 Instr1: 000000000000000000000011000100001010000000000010101011001000000000100001010010101101110000000000000000000000000000000000000001 

同样,我有File2(第三列126位长):

3367689 
2267317 Instr1: 000000000000000000000000000000001010000101001000101000101000000000100000000100101001000000000000000000000000000000000000000001 
 395148 Instr1: 000000000000000000000000000000001010000101011110101011011000000000100000000100101000000000000000000000000000000000000000000001 
 393699 Instr1: 000000110110000000000110100100010110000000010000100101000000000000100000000100101111100000000000000000000000000000000000000001 
 283811 Instr1: 000000110110000000000000000101000010000000000101100101000000000000100000000100100111000000000000000000000000000000000000000001 
   4961 Instr1: 000001010111111111111110100100000000010101000101100110100000000000100000000011111000010000000000000000000000000000000000000001 
   3350 Instr1: 000001010111111111111111000100000000000101000011100110100000000000100000000011110111010000000000000000000000000000000000000001 
   1975 Instr1: 000000110111111111111100000100001010000000000101100101000000000000100000000011110100010000000000000000000000000000000000000001 
   1928 Instr1: 000000110111111111111110000100001010000000000101100101000000000000100000000011110110010000000000000000000000000000000000000001 
   1833 Instr1: 000000110111111111111100100100001010000000000101100101000000000000100000000011110101010000000000000000000000000000000000000001 
   1725 Instr1: 000000000000000000000011111100001010000000001000101010111000000000100000000011110010010000000000000000000000000000000000000001 
   1575 Instr1: 000000000000000000000000000000010110001001000010101000010000000000100000000011110011010000000000000000000000000000000000000001 
   1487 Instr1: 000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 
    584 Instr1: 000000000000000000000000000000000000000000000000000000000000000000000000000000100110000000000000000000000000000000000000000001 
    495 Instr1: 000000000000000000000000001101011110000000010111101000011000000000100000000100101100110000000000000000000000000000000000000001 
    481 Instr1: 000000000000000000000001000101110110000000011101101000011000000000100000000011111001000000000000000000000000000000000000000001 
    452 Instr1: 000001010110000000000010000100000000010001011101100110100000000000100000000100101100000000000000000000000000000000000000000001 
    376 Instr1: 000000110110000000001000000100100010000000011101100101000000000000100000000100101010000000000000000000000000000000000000000001 
    342 Instr1: 000000000000000000000000000000010110101111000000101000010000000000100000000100101011000000000000000000000000000000000000000001 
    339 Instr1: 000001010110000000000010100100000000010101000010100110100000000000100000000011110001000000000000000000000000000000000000000001 
    339 Instr1: 000000000001111111111111000101110110000000011101101000011000000000100000000011101111000000000000000000000000000000000000000001 
    339 Instr1: 000000000000000000000000101100001010000000001001101010101000000000100000000011110011000000000000000000000000000000000000000001 
    339 Instr1: 000000000000000000000000101100001010000000000101101010101000000000100000000011110000000000000000000000000000000000000000000001 
    339 Instr1: 000000000000000000000000001100110010000000000000101000011000000000100000000011110010000000000000000000000000000000000000000001 
    325 Instr1: 000000110110000000000101100100001010000000010000100101000000000000100000000100101000100000000000000000000000000000000000000001 
    325 Instr1: 000000000000000000000000000000001110010001000010101000010000000000100000000100101001100000000000000000000000000000000000000001 
    257 Instr1: 000000000000000000000000000000000000000000000000000000000000000000000000000000100111000000000000000000000000000000000000000001 
    120 Instr1: 000001010111111111111110000100000000010101000101100110100000000000100000000011111000000000000000000000000000000000000000000001 
    120 Instr1: 000001010111111111111110000100000000000101000011100110100000000000100000000011110110000000000000000000000000000000000000000001 
    120 Instr1: 000001010111111111111100000100000000000101000011100110100000000000100000000011110101000000000000000000000000000000000000000001 
    120 Instr1: 000000000000000000000000000000100010010011000000101000010000000000100000000011110111000000000000000000000000000000000000000001 
     84 Instr1: 000000000000000000000000000000000000000000000000000000000000000000000000000000100101000000000000000000000000000000000000000001 
     84 Instr1: 000000000000000000000000000000000000000000000000000000000000000000000000000000100100000000000000000000000000000000000000000001 
     84 Instr1: 000000000000000000000000000000000000000000000000000000000000000000000000000000100011000000000000000000000000000000000000000001 
     84 Instr1: 000000000000000000000000000000000000000000000000000000000000000000000000000000100010000000000000000000000000000000000000000001 
     84 Instr1: 000000000000000000000000000000000000000000000000000000000000000000000000000000100001000000000000000000000000000000000000000001 
     84 Instr1: 000000000000000000000000000000000000000000000000000000000000000000000000000000100000000000000000000000000000000000000000000001 
     84 Instr1: 000000000000000000000000000000000000000000000000000000000000000000000000000000011111000000000000000000000000000000000000000001 
     84 Instr1: 000000000000000000000000000000000000000000000000000000000000000000000000000000011110000000000000000000000000000000000000000001 
     84 Instr1: 000000000000000000000000000000000000000000000000000000000000000000000000000000011101000000000000000000000000000000000000000001 
     84 Instr1: 000000000000000000000000000000000000000000000000000000000000000000000000000000011100000000000000000000000000000000000000000001 
     84 Instr1: 000000000000000000000000000000000000000000000000000000000000000000000000000000011011000000000000000000000000000000000000000001 
     84 Instr1: 000000000000000000000000000000000000000000000000000000000000000000000000000000011010000000000000000000000000000000000000000001 
     84 Instr1: 000000000000000000000000000000000000000000000000000000000000000000000000000000011001000000000000000000000000000000000000000001 
     84 Instr1: 000000000000000000000000000000000000000000000000000000000000000000000000000000011000000000000000000000000000000000000000000001 

文件不一定具有相同的行数(行)。现在,我想比较这两个文件,并找出它们之间是否有任何共同的第三列和每个文件中第1列的相应数字:

Example Output(randomly doing):
FileA   FileB   Data
290704  283811  000000000001111111111111110100001110000000000000101000011000000000100001100101110010000000000000000000000000000000000000000001

我已经使用follwoing命令生成了这些文件:

sort result.txt | uniq -c | sort -nr > File1.txt

现在我不确定如何找到共性。 unix“comm”对我不起作用。我想我可能需要使用“awk”或Python。但欢迎任何建议。

PS:这不是硬问题

2 个答案:

答案 0 :(得分:2)

在awk中。这是一个awk经典,足以学习语言的理由,是通向更好shell的途径:

onTouch()

说明:

Square

编辑:如果您有多个文件且可能有多次点击:

首先,更多测试数据(三个文件中的每个文件中有一个唯一记录,两个文件中有一个中继记录,三个文件中有一个):

$ awk 'NR==FNR{a[$3]=$1;next}$3 in a{print $1, a[$3], $3}' f1 f2
1487 224885 000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000

然后代码(这不是经典,不。这只是一个致敬):

NR==FNR{                 # process first file (the smaller)
    a[$3]=$1             # hash to a using $3 as key
    next                 # skip to next record
}
$3 in a{                 # when a match is found processing the second file
    print $1, a[$3], $3  # output in desired order
}
' f1 f2                  # smaller file first as it is hashed to memory

请注意,所有yor数据都将存储到内存中,因此需要足够的内存。

答案 1 :(得分:0)

我会使用sqlite数据库来解决这个问题,它非常容易学习,一旦你掌握了基础知识,它将解决你将面临的其他方法遇到的许多问题

只需从SQLite Browser

下载sqlite浏览器即可

参加Coursera或Udacity的在线课程

对于您的问题,它可以像

一样简单
CREATE TABLE newtable AS SELECT column1.file1 
  FROM column3.file1
  JOIN column3.file2
  ON column3.file1=column3.file2