用于查找两个文件的列总和的差异的脚本

时间:2014-12-25 20:23:36

标签: linux bash unix scripting

我有两个记录的文件

   rec10|rec11|rec12|....|abcd1234|rec19|rec110|rec111|name1|xyz|.......|rec1n
   rec20|rec21|rec22|....|abcd1234|rec29|rec210|rec211|name1|xyz|.......|rec2n
   rec30|rec31|rec32|....|xyzw1234|rec39|rec310|rec311|name1|uvw|.......|rec3n
   ...........................................................................
   ...........................................................................

有些列是关键列,我可以剪切并放入另一个文件(比如keyFile)

  cat recordFile|cut -d"|", -f1,5,7 >keyFile

现在,对于keyFile中的每个键K,我必须过滤以K为键的行并获得列式和

我需要为recordFile2做同样的事情

我想要关键和列明的差异

让我们说文件1是

x,y,z,5,6,7
a,y,z,3,5,8
a,x,t,1,1,1

和文件2是

x,y,s,1,2,3
p,y,z,3,5,8
a,y,z,1,1,1

让我们说第2列和第3列是关键列,如果我剪切这些列,则不同的键是(y,z)(x,t)(y,s) 对于每个键,我需要找到列式和的差异

对(y,z)说  我总和得到8,11,15 类似地,文件2获得4,6,9 差异是4,5,6 所以输出是(y,z)4 5 6

类似于其他键

while read line  //read one key each time from inKeyFile
       IFS=', ' read -a array <<< "$line"
       for element in "${array[@]}"
       do

// filter rows which matched whole key array .**How to put the filter condition in awk for complete key value in array**
<code>
       IFS=' ' read -a arrayA<<<  awk -F"|" -v k="$num1" -v n="$num2" '$col1=array[0] && $col2=array[1]&& so on.. {for(i=k;i<=n;i++)s[i]+=$i} END{for(x in s)printf " %f ",s[x]}' recordFile1

      //read the awk output into an array A of size num2-num1+1
    //same for Recordsfile2 to read in an array B
      IFS=' ' read -a arrayB<<<  awk .....
     print line-->(the key)
      for(i=num1 to num2) print $A[i] -$B[i]

<<inKeyFile  

如何将过滤器放入awk中,比如我在文件2中运行它./Myscript.sh:x,3:y,5:z 10 15 具有列10到列15的列式和,其中键列具有指定的值 第2,3,5列是关键列(我可以将它们剪切并放入inKeyFile中),第2列应为x,第3列应为y,第5列应为z。如何在awk中应用此过滤器?

如何避免处理已经打印了差异的inKeyFile中的密钥(类似于Java中的Set)? 编辑:我想我可以对inKeyFile进行排序,如果上次读取的键与当前键相同,那么我可以跳过

1 个答案:

答案 0 :(得分:1)

要查找差异file1 - file2,将其作为按所选列分组的行数之和的差异,例如12(从零开始):

$ ./columnwise-sum-diff 1,2 file1 file
{"y|z": [4, 5, 6]}

其中columnwise-sum-diff是:

#!/usr/bin/env python
import json
import sys
from operator import itemgetter

def columnwise_sum(a, b):
    return tuple(x+y for x, y in zip(a, b)) # map(sum, zip(*args))

def columnwise_diff(a, b):
    return tuple(y-x for x, y in zip(a, b)) # b - a

def sum_file(filename, get_key, get_numbers):
    filesum = {}
    with open(filename) as file:
        for line in file:
            row = line.split(',')
            key = get_key(row)
            numbers = get_numbers(row)
            total = filesum.get(key)
            filesum[key] = columnwise_sum(total, numbers) if total else numbers
    return filesum

if len(sys.argv) != 4:
    sys.exit('Usage: columnwise-sum-diff <keycol1,keycol2> <file1> <file2>')

key_columns = sorted(map(int, sys.argv[1].split(',')))
get_key = itemgetter(*key_columns)
n = max(key_columns) + 1 # to the right of the key columns

def get_numbers(row, getcols=itemgetter(*range(n, n + 3))):
    return tuple(map(int, getcols(row)))

file1sum = sum_file(sys.argv[2], get_key, get_numbers)
file2sum = sum_file(sys.argv[3], get_key, get_numbers)
diff = {'|'.join(k): columnwise_diff(file2sum[k], file1sum[k])
        for k in file1sum.viewkeys() & file2sum.viewkeys()}
json.dump(diff, sys.stdout)

它生成json以简化结构化数据交换。