在我的第一个问题here之后,我想扩展从第一列和第二列的两个不同文件中找到最接近值的条件,并打印特定列。
File1中
1 2 3 4 a1
1 4 5 6 b1
8 5 9 11 c1
文件2
1 1 3 a
1 2 5 b
1 2.1 4 c
1 4 6 d
2 4 5 e
9 4 1 f
9 5 2 g
9 6 2 h
11 10 14 i
11 15 5 j
因此,例如,我需要在文件1中为文件1中的每个$ 1找到距离$ 1最近的值,但最后也搜索最近的$ 2。
输出:
1 2 a1*
1 2 b*
1 4 b1
1 4 d
8 5 c1
9 5 g
*第一列文件1和第二列文件2,因为对于第一列(文件1),最接近的值(来自文件2的第1列)是1,第二个条件是也必须是最接近的值对于第二列,这种情况是2.我从文件1打印$ 1,$ 2,$ 5,从文件2打印$ 1,$ 2,$ 4
对于其他输出是相同的过程。
找到最接近的解决方案是在我的另一篇文章中,由@Tensibai提供。 但任何解决方案都可行。 谢谢!
答案 0 :(得分:1)
听起来有点复杂但有效:
function closest(array,searched) {
distance=999999; # this should be higher than the max index to avoid returning null
split(searched,skeys,OFS)
# Get the first part of key
for (x in array) { # loop over the array to get its keys
split(x,mkeys,OFS) # split the array key
(mkeys[1]+0 > skeys[1]+0) ? tmp = mkeys[1] - skeys[1] : tmp = skeys[1] - mkeys[1] # +0 to compare integers, ternary operator to reduce code, compute the diff between the key and the target
if (tmp < distance) { # if the distance if less than preceding, update
distance = tmp
found1 = mkeys[1] # and save the key actually found closest
}
}
# At this point we have the first part of key found, let's redo the work for the second part
distance=999999;
for (x in array) {
split(x,mkeys,OFS)
if (mkeys[1] == found1) { # Filter on the first part of key
(mkeys[2]+0 > skeys[2]+0) ? tmp = mkeys[2] - skeys[2] : tmp = skeys[2] - mkeys[2] # +0 to compare integers, ternary operator to reduce code, compute the diff between the key and the target
if (tmp < distance) { # if the distance if less than preceding, update
distance = tmp
found2 = mkeys[2] # and save the key actually found closest
}
}
}
# Now we got the second field, woot
return (found1 OFS found2) # return the combined key from out two search
}
{
if (NR>FNR) { # If we changed file (File Number Record is less than Number Record) change array
b[($1 OFS $2)] = $4 # make a array with "$1 $2" as key and $4 as value
} else {
key = ($1 OFS $2) # Make the key to avoid too much computation accessing it later
akeys[max++] = key # store the array keys to ensure order at end as for (x in array) does not guarantee the order
a[key] = $5 # make an array with the key stored previously and $5 as value
}
}
END { # Now we ended parsing the two files, print the result
for (i in akeys) { # loop over the array of keys which has a numeric index, keeping order
print akeys[i],a[akeys[i]] # print the value for the first array (key then value)
if (akeys[i] in b) { # if the same key exist in second file
print akeys[i],b[akeys[i]] # then print it
} else {
bindex = closest(b,akeys[i]) # call the function to find the closest key from second file
print bindex,b[bindex] # print what we found
}
}
}
注意我使用OFS来组合字段,因此如果您为输出更改它,它将表现正常。
警告:这应该与相对较短的文件有关,但是现在第二个文件中的数组遍历两次,每次搜索的时间长两倍 END OF WARNING < / p>
如果您的文件已经排序,那么可以选择更好的搜索算法(但前一个问题并非如此,您希望保留文件中的顺序)。在这种情况下的第一个改进,当距离开始大于前一个时,打破for循环。
示例文件的输出:
$ mawk -f closest2.awk f1 f2
1 2 a1
1 2 b
1 4 b1
1 4 d
8 5 c1
9 5 g