我正在处理一个大型csv文件(数百万行和8万列)。我想在新文件中提取并保存所有行,并仅保存外部文本文件中列出的那些列。例如:
id,snp1,snp2,snp3,snp4,snp5,snp6,snp7,snp8,snp9,snp10
sampl1,AA,BB,AB,BB,AA,AA,AB,BB,BB,BB
sampl2,AA,BB,BB,BB,AB,AA,AB,BB,BB,BB
sampl3,AA,BB,AB,BB,BB,AA,AA,BB,BB,BB
sampl4,AA,BB,AA,BB,AB,AA,BB,BB,BB,BB
sampl5,AA,BB,AB,BB,AB,AA,AA,BB,BB,BB
sampl6,AA,BB,AB,BB,BB,AA,AB,BB,BB,BB
sampl7,AA,BB,BB,AB,AB,AA,AB,BB,BB,BB
snp3
snp6
snp7
snp10
id,snp3,snp6,snp7,snp10
sampl1,AB,AA,AB,BB
sampl2,BB,AA,AB,BB
sampl3,AB,AA,AA,BB
sampl4,AA,AA,BB,BB
sampl5,AB,AA,AA,BB
sampl6,AB,AA,AB,BB
sampl7,BB,AA,AB,BB
使用awk有没有一种有效的方法呢?
答案 0 :(得分:2)
非awk解决方案
$ cut -d, -f1,$(grep -Ff columns <(sed 1q file | tr ',' '\n' | nl -w1) | cut -f1 | paste -sd,) file
id,snp3,snp6,snp7,snp10
sampl1,AB,AA,AB,BB
sampl2,BB,AA,AB,BB
sampl3,AB,AA,AA,BB
sampl4,AA,AA,BB,BB
sampl5,AB,AA,AA,BB
sampl6,AB,AA,AB,BB
sampl7,BB,AA,AB,BB
或
awk
救援!
$ awk 'NR==FNR {cols[$1]; next}
FNR==1 {for(i=2;i<=NF;i++) if($i in cols) colin[i]}
{line=$1;
for(i=1;i<=NF;i++) if(i in colin) line=line FS $i;
print line}' columns FS=, file
id,snp3,snp6,snp7,snp10
sampl1,AB,AA,AB,BB
sampl2,BB,AA,AB,BB
sampl3,AB,AA,AA,BB
sampl4,AA,AA,BB,BB
sampl5,AB,AA,AA,BB
sampl6,AB,AA,AB,BB
sampl7,BB,AA,AB,BB
答案 1 :(得分:1)
我建议使用csvkit。 Csvkit它为该作业构建,并且如果某些数据是双引号中的字符串',则正常工作。
安装:
Date HighRateMinValue HighRateMaxValue LowRateMinValue LowRateMaxValue
2017-11-16 1358.1080322265625 1362.0860595703125 1252.5179443359375 1252.7440185546875
2017-11-17 1362.0860595703125 1370.177978515625 1252.761962890625 1254.592041015625
2017-11-18 1370.177978515625 1370.177978515625 1254.6280517578125 1262.7679443359375
2017-11-19 1370.177978515625 1370.177978515625 1262.7840576171875 1272.72900390625
2017-11-20 1370.177978515625 1375.876953125 1272.7469482421875 1274.7969970703125
2017-11-21 1375.876953125 1383.2359619140625 1274.81298828125 1277.541015625
使用
sudo apt python3-csvkit
-c选项取列的名称,tr用于将字符'\ n'替换为','。因为,我们不希望我们的参数以''结束,我们使用sed来删除它。
答案 2 :(得分:0)
$ cat tst.awk
BEGIN { FS=OFS="," }
NR==FNR {
list["id"]
list[$0]
next
}
FNR==1 {
for (i=1; i<=NF; i++) {
if ($i in list) {
f[++nf] = i
}
}
}
{
for (i=1; i<=nf; i++) {
printf "%s%s", $(f[i]), (i<nf ? OFS : ORS)
}
}
$ awk -f tst.awk list file
id,snp3,snp6,snp7,snp10
sampl1,AB,AA,AB,BB
sampl2,BB,AA,AB,BB
sampl3,AB,AA,AA,BB
sampl4,AA,AA,BB,BB
sampl5,AB,AA,AA,BB
sampl6,AB,AA,AB,BB
sampl7,BB,AA,AB,BB