如何将两列CSV文件转换为连续的整数?

时间:2014-08-11 17:17:31

标签: shell csv

你好说我有这个文件file1.csv,它有2列a和b,它们都是22个char字符串。它看起来像这样:

hWcYwgRKOD77hfm1oKE0IA,5HleiJXMsFkGEsr8Jqr3Ug
hWcYwgRKOD77hfm1oKE0IA,rCDlYd2WHJuiT05sYGxaVA
65q0c2Iw03B8eSuHHTETHw,G40NUD0/op+13yjzBw+hrw
65q0c2Iw03B8eSuHHTETHw,1u8UW/cQ4i1vbSF9wvzu3w
...

我想将a,b列转换为连续的整数,如:

1,1
1,2
2,3
2,4

有谁知道我该怎么办?我顺便使用Ubuntu 12.04

如果我有另一个文件file2.csv,列a'和b'。有没有办法对file2做同样的事情,如果" hWcYwgRKOD77hfm1oKE0IA"在file1中是1然后" hWcYwgRKOD77hfm1oKE0IA"如果出现,则在file2中为1。列b和b'相同。我想从这两个文件中获得单个输出:result1.csv和result2.csv

1 个答案:

答案 0 :(得分:2)

awk -F, -v OFS=, '{ if ($1 in a) { $1 = a[$1] } else { $1 = a[$1] = ++x } 
                    if ($2 in b) { $2 = b[$2] } else { $2 = b[$2] = ++y } } 1' file

或者可能更简单但效率可能更低:

awk -F, -v OFS=, '!($1 in a) { a[$1] = ++x } { $1 = a[$1] }
                  !($2 in b) { b[$2] = ++y } { $2 = b[$2] } 1' file

或动态到任意数量的列:

awk -F, -v OFS=, '{ for (i = 1; i <= NF; ++i)
                        if ((i, $i) in a) { $i = a[i, $i] }
                                     else { $i = a[i, $i] = ++x[i] } } 1' file

这也类似于

awk -F, -v OFS=, '{ for (i = 1; i <= NF; ++i) {
                    if (!((i, $i) in a)) a[i, $i] = ++x[i]
                    $i = a[i, $i] } } 1' file

输出:

1,1
1,2
2,3
2,4

更新

要应用于两个文件,请尝试:

awk -F, -v OFS=, '{ if ($1 in a) { $1 = a[$1] } else { $1 = a[$1] = ++x } 
                    if ($2 in b) { $2 = b[$2] } else { $2 = b[$2] = ++y } 
                    print > "result_" FILENAME }' file1 file2

更新02

awk -F, -v OFS=, '!($1 in a) { a[$1] = ++x } !($2 in b) { b[$2] = ++y }
                  { print $1, $2, a[$1], b[$2] }' file

输出:

hWcYwgRKOD77hfm1oKE0IA,5HleiJXMsFkGEsr8Jqr3Ug,1,1
hWcYwgRKOD77hfm1oKE0IA,rCDlYd2WHJuiT05sYGxaVA,1,2
65q0c2Iw03B8eSuHHTETHw,G40NUD0/op+13yjzBw+hrw,2,3
65q0c2Iw03B8eSuHHTETHw,1u8UW/cQ4i1vbSF9wvzu3w,2,4

按文件版本:

awk -F, -v OFS=, '!($1 in a) { a[$1] = ++x } !($2 in b) { b[$2] = ++y }
                  { print $1, $2, a[$1], b[$2] > "result_" FILENAME }' file1 file2