Unix:根据外部文件过滤CSV列

时间:2017-11-21 19:48:24

标签: unix awk

我正在处理一个大型csv文件(数百万行和8万列)。我想在新文件中提取并保存所有行,并仅保存外部文本文件中列出的那些列。例如:

源数据文件

id,snp1,snp2,snp3,snp4,snp5,snp6,snp7,snp8,snp9,snp10
sampl1,AA,BB,AB,BB,AA,AA,AB,BB,BB,BB
sampl2,AA,BB,BB,BB,AB,AA,AB,BB,BB,BB
sampl3,AA,BB,AB,BB,BB,AA,AA,BB,BB,BB
sampl4,AA,BB,AA,BB,AB,AA,BB,BB,BB,BB
sampl5,AA,BB,AB,BB,AB,AA,AA,BB,BB,BB
sampl6,AA,BB,AB,BB,BB,AA,AB,BB,BB,BB
sampl7,AA,BB,BB,AB,AB,AA,AB,BB,BB,BB

包含要保留的列列表的外部文件 -

snp3
snp6
snp7
snp10

结果(新)文件

id,snp3,snp6,snp7,snp10
sampl1,AB,AA,AB,BB
sampl2,BB,AA,AB,BB
sampl3,AB,AA,AA,BB
sampl4,AA,AA,BB,BB
sampl5,AB,AA,AA,BB
sampl6,AB,AA,AB,BB
sampl7,BB,AA,AB,BB

使用awk有没有一种有效的方法呢?

3 个答案:

答案 0 :(得分:2)

非awk解决方案

$ cut -d, -f1,$(grep -Ff columns <(sed 1q file | tr ',' '\n' | nl -w1) | cut -f1 | paste -sd,) file

id,snp3,snp6,snp7,snp10
sampl1,AB,AA,AB,BB
sampl2,BB,AA,AB,BB
sampl3,AB,AA,AA,BB
sampl4,AA,AA,BB,BB
sampl5,AB,AA,AA,BB
sampl6,AB,AA,AB,BB
sampl7,BB,AA,AB,BB

awk救援!

$ awk 'NR==FNR {cols[$1]; next}
       FNR==1  {for(i=2;i<=NF;i++) if($i in cols) colin[i]}
               {line=$1;
                for(i=1;i<=NF;i++) if(i in colin) line=line FS $i; 
                print line}' columns FS=, file

id,snp3,snp6,snp7,snp10
sampl1,AB,AA,AB,BB
sampl2,BB,AA,AB,BB
sampl3,AB,AA,AA,BB
sampl4,AA,AA,BB,BB
sampl5,AB,AA,AA,BB
sampl6,AB,AA,AB,BB
sampl7,BB,AA,AB,BB

答案 1 :(得分:1)

我建议使用csvkit。 Csvkit它为该作业构建,并且如果某些数据是双引号中的字符串',则正常工作。

安装:

Date    HighRateMinValue    HighRateMaxValue    LowRateMinValue LowRateMaxValue
2017-11-16  1358.1080322265625  1362.0860595703125  1252.5179443359375  1252.7440185546875
2017-11-17  1362.0860595703125  1370.177978515625   1252.761962890625   1254.592041015625
2017-11-18  1370.177978515625   1370.177978515625   1254.6280517578125  1262.7679443359375
2017-11-19  1370.177978515625   1370.177978515625   1262.7840576171875  1272.72900390625
2017-11-20  1370.177978515625   1375.876953125  1272.7469482421875  1274.7969970703125
2017-11-21  1375.876953125  1383.2359619140625  1274.81298828125    1277.541015625

使用

sudo apt python3-csvkit

-c选项取列的名称,tr用于将字符'\ n'替换为','。因为,我们不希望我们的参数以''结束,我们使用sed来删除它。

答案 2 :(得分:0)

$ cat tst.awk
BEGIN { FS=OFS="," }
NR==FNR {
    list["id"]
    list[$0]
    next
}
FNR==1 {
    for (i=1; i<=NF; i++) {
        if ($i in list) {
            f[++nf] = i
        }
    }
}
{
    for (i=1; i<=nf; i++) {
        printf "%s%s", $(f[i]), (i<nf ? OFS : ORS)
    }
}

$ awk -f tst.awk list file
id,snp3,snp6,snp7,snp10
sampl1,AB,AA,AB,BB
sampl2,BB,AA,AB,BB
sampl3,AB,AA,AA,BB
sampl4,AA,AA,BB,BB
sampl5,AB,AA,AA,BB
sampl6,AB,AA,AB,BB
sampl7,BB,AA,AB,BB