Question

我需要将两个文件（new.txt和old.txt）与以下结构进行比较：

 <Field1>,<Field2>,<Field3>,<Field4>,<Field5>,<Field6>

必须跳过公共行。
应该对来自new.txt和old.txt的类似行进行分组。我想来自old.txt的那行与new.txt中的行相似if Field1，Field2，Field3，Field4是相同的。
其他唯一的行应打印在按文件名分组

所以最终的任务是让视觉比较更容易。

添加部分 实施例。

$ cat old.txt 
 one,two,three,four,five,six
 un,deux,trois,quatre,cinq,six
 eins, zwei, drei, vier, fünf, sechs
$ cat new.txt 
 one,two,three,four,FIVE,SIX
 un,deux,trois,quatre,cinq,six
 en,två,tre,fyra,fem,sex

$cat comparison_result:
# lines are grouped. So it it easy to find the difference without scrolling.
old.txt> one,two,three,four,five,six
new.txt> one,two,three,four,FIVE,SIX
# end of task 2. There are no more simillar lines.
#
#start task 3.
#Printing all the rest unique lines of old.txt 
echo "the rest unique line in old.txt"
eins, zwei, drei, vier, fünf, sechs
.... 
#Printing all the rest unique lines of new.txt
echo "the rest unique line in new.txt"
en,två,tre,fyra,fem,sex

这可以是第1步：跳过常用行。

 # This is only in old.txt
 comm -2 -3 <(sort old.txt) <(sort new.txt) > uniq_old

 # This is only in new.txt
 comm -1 -3 <(sort old.txt) <(sort new.txt) > uniq_new

我写了第1步，并将此排序差异作为临时解决方案：

 # additional sort improves a bit diffs results.
 diff <(sort uniq_old) <(sort uniq_new)

它有效但不理想。我拒绝使用diff，因为它开始比较块，缺少公共线。

有没有更好的方法来满足上面提到的3个要求？

我认为可以通过

对此类，diff和comm命令进行了一些改进（将sed / tr添加到临时“隐藏”最后两个字段并比较其余部分）。
AWK

我认为awk能做得更好吗？

Answer 1

这个怎么样？

awk -F, 'NR==FNR{old[$0];next} $0 in old{delete old[$0];next} 1 END{for(line in old) print line}' old.txt <(sort -u new.txt) | sort

让我们把它分解成几部分。

-F,告诉awk使用,作为字段分隔符。
NR==FNR{old[$0];next} - 如果NR（记录/行号）与当前文件中的行号匹配（即，当我们读取第一个输入文件时），商店整行作为关联数组的索引，然后跳转到下一条记录。
$0 in old{delete old[$0];next} - 现在我们正在阅读第二个文件。如果当前行在数组中，则从数组中删除if并继续。您的问题中的地址条件为＃1。
1 - 用于打印“打印线”的短手。这通过打印第二个文件中的唯一行来解决问题中条件＃3的一部分。
END{...} - 此循环打印未从数组中删除的所有内容。这通过打印第一个文件中的唯一行来解决条件＃3的其他部分。
<(sort -u new.txt) - 取消对new.txt的输入。如果您知道new.txt已经是唯一的，则可以删除此bash依赖项。
| sort对输出进行排序，在问题中按条件＃2“分组”。

示例输出：

 $ cat old.txt 
 one,two,three,four,five,six
 un,deux,trois,quatre,cinq,six
 $ cat new.txt 
 one,two,three,four,FIVE,SIX
 un,deux,trois,quatre,cinq,six
 en,två,tre,fyra,fem,sex
 $ awk -F, 'NR==FNR{old[$0];next} $0 in old{delete old[$0];next} 1 END{for(line in old) print line}' old.txt new.txt | sort
 en,två,tre,fyra,fem,sex
 one,two,three,four,FIVE,SIX
 one,two,three,four,five,six
 $

请注意，法语行重复，因此删除。其他所有内容都打印出来，两条英文行通过排序“分组”。

另请注意，此解决方案会受到非常大的文件的影响，因为所有old.txt都会作为数组加载到内存中。可能适合您的替代方案是：

 $ sort old.txt new.txt | awk '$0==last{last="";next} last{print last} {last=$0} END{print last}' | sort
 en,tva,tre,fyra,fem,sex
 one,two,three,four,FIVE,SIX
 one,two,three,four,five,six
 $

这里的想法是，您只需从文件中获取所有输入数据，对其进行排序，然后使用awk脚本跳过重复的行，并打印所有其他内容。然后对输出进行排序。就awk而言，这适用于流，但是要注意，对于非常大的输入，您的sort命令仍然需要将数据加载到内存和/或临时文件中。

此外，如果特定行重复多次，则第二种解决方案失败。也就是说，如果它在old.txt中存在一次，在new.txt中存在两次。您需要对输入文件进行唯一处理，或者针对该情况调整脚本。

bash中的字符串比较（结构化文本）

1 个答案: