快速验证两个"细胞"匹配,在大型数据集中

时间:2015-02-19 23:27:17

标签: bash verify

我正在处理大量数据(每次检查几百行),并想知道比较两组不同数据的最有效方法是什么。

我正在寻找的是找到以下差异:

来自来源1:

site1.49729    site2.80124             /path/path/path/path               
site1.49730    site2.80125             /path/path/path/path               
site1.49734    site2.80126             /path/path/path/path               
site1.49735    site2.80127             /path/path/path/path               
site1.49736    site2.80128             /path/path/path/path               
site1.49737    site2.80129             /path/path/path/path               
site1.49738    site2.80131             /path/path/path/path               
site1.49752    site2.80171             /path/path/path/path

来自来源2:

site1.49729    site2.80124             /path/path/path/path               
site1.49730    site2.80125             /path/path/path/path               
site1.49734    **site2.1234**              /path/path/path/path               
site1.49735    site2.80127             /path/path/path/path               
site1.49736    site2.80128             /path/path/path/path               
site1.49737    **site2.12345**             /path/path/path/path               
site1.49738    site2.80131             /path/path/path/path               
site1.49752    site2.80171             /path/path/path/path
**site1.49735    site2.99999               /path/path/path/path**
用**

突出显示的差异

确保两个命令的第二列中的所有内容都不会丢失,并且#2与记录完全匹配的最有效方法是什么?

关于从哪里开始的任何想法?

2 个答案:

答案 0 :(得分:0)

我建议只针对源1和源2运行diff。它会显示包含差异的行。将源1的内容放在s1.txt中,将源2的内容放在s2.txt中,然后运行命令:

$ diff -y s1.txt s2.txt

这将显示两个文件之间的差异。

答案 1 :(得分:0)

使用'diff'命令。它为您的情况生成如下所示的输出:

< site1.49734    site2.80126             /path/path/path/path
---
> site1.49734    **site2.1234**              /path/path/path/path
6c6
< site1.49737    site2.80129             /path/path/path/path
---
> site1.49737    **site2.12345**             /path/path/path/path
8c8,9
< site1.49752    site2.80171             /path/path/path/path
\ No newline at end of file
---
> site1.49752    site2.80171             /path/path/path/path
> **site1.49735    site2.99999               /path/path/path/path**

有许多文本编辑器提供用于区分文件或查看差异的GUI(例如Notepad ++)