使用bash实用程序提取文本数据

时间:2015-05-18 12:07:30

标签: bash text multiple-columns

我从大的CSV日志中提取一些看起来像

的相关数据,这是一项非常重要的任务
Frame #,Residue,Internal,van der Waals,Electrostatic,Polar Solvation,Non-Polar Solv.,TOTAL
1,1,119.745,0.356,-132.009,-95.618,1.7886312,-105.7373688
1,2,106.093,-3.835,-182.473,40.582,0.7132608,-38.9197392
1,3,21.228,-1.744,-38.026,-7.707,1.1189664,-25.1300336
1,4,-5.717,-4.721,-30.38,-4.839,0.406512,-45.250488
1,5,70.846,-4.127,-53.317,-2.534,0.7808472,11.6488472
...
2,1,119.745,0.356,-132.009,-95.618,1.7886312,-105.7373688
2,2,106.093,-3.835,-182.473,40.582,0.7132608,-38.9197392
2,3,21.228,-1.744,-38.026,-7.707,1.1189664,-25.1300336
2,4,-5.717,-4.721,-30.38,-4.839,0.406512,-45.250488
2,5,70.846,-4.127,-53.317,-2.534,0.7808472,11.6488472
...
n,1,119.745,0.356,-132.009,-95.618,1.7886312,-105.7373688
n,2,106.093,-3.835,-182.473,40.582,0.7132608,-38.9197392
n,3,21.228,-1.744,-38.026,-7.707,1.1189664,-25.1300336
n,4,-5.717,-4.721,-30.38,-4.839,0.406512,-45.250488
n,5,70.846,-4.127,-53.317,-2.534,0.7808472,11.6488472

这里我想从第2列(#residue)中选择指定的一个值,并根据第1列(#total energy)编写其最后一列(#total energy)的evolution(#snapshot number列的函数)#frame数)。换句话说,我需要1)按照第2列的方式对所有数据进行排序):即选择第二列中的数字等于指定值的每个字符串(即n = 27)

#Frame, #Residue

1,27, ... , # last column value which is interested for me!
2,27, ... , # last column value which is interested for me!
3,27, ... , # last column value which is interested for me!
3,27, ... , # last column value which is interested for me!

并提取其最后一列的相应值,以便得出的日志将具有onlu 3列:

#Frame, #Residue, # Total energy

1,27, # last column value which is interested for me!
2,27, # last column value which is interested for me!
3,27, # last column value which is interested for me!
3,27, # last column value which is interested for me!

感谢使用awk和sed的任何实现!

谢谢!

格列勃

2 个答案:

答案 0 :(得分:2)

要在第二列中提取27行,您可以使用grep

  grep '^[^,]\+,27,' input.csv
        | |   |
beginning |   |
    not comma |
              repeated

要仅输出第1列,第2列和第8列,请使用cut

grep '^[^,]\+,27' input.csv | cut -d, -f1,2,8
                                   |   |
                             delimiter |
                                      fields

要按第二列对文件进行排序,您可以使用sort

sort -t, -nk2,2 input.csv
      |   | |
delimiter | |
    numeric |
    sort    by only the second field

答案 1 :(得分:2)

这是一个awk解决方案:

awk -v n=27 'BEGIN { OFS = FS = "," } $2 == n { print $1, $2, $NF }' input.csv
  • -v n=27 - 首先分配一个awk变量n27
  • BEGIN { OFS = FS = "," } - 在awk开始解析任何数据之前运行BEGIN部分。这里我们将FS(字段分隔符)和OFS(输出字段分隔符)都设置为",",以便输入行和输出行将用逗号分隔/分隔。
  • $2 == n { print $1, $2, $NF } - 对于第二个字段($ 2)等于n的任何记录(行),输出第一个,第二个和最后一个字段。

m 匹配后停止:

awk -v n=27 -v m=3 'BEGIN { OFS = FS = "," } $2 == n { print $1, $2, $NF; if (++count == m) exit}' input.csv