Question

我有一些大的txt文件作为输入，看起来像

# USER_IP: 37.1.62.12 INTERFACE CHARMM-GUI
@<TRIPOS>MOLECULE
lig.pdb
54 56 1 0 0
SMALL
NO_CHARGES


@<TRIPOS>ATOM
      1 CAA         2.9880    0.1910   12.9830 C.3       1 P0G    0.0000
      2 CAB         1.3730    1.7370   10.6500 C.3       1 P0G    0.0000
      3 CAC        -0.5820    0.2000   10.5350 C.3       1 P0G    0.0000
      4 OAD        -5.1220    5.7850    8.9220 O.2       1 P0G    0.0000
      5 OAE        -2.7610    6.1960    4.9010 O.3       1 P0G    0.0000
      6 OAF        -0.8620    0.4430    6.3540 O.3       1 P0G    0.0000
      7 CAG         0.7160   -2.5530   14.2490 C.ar      1 P0G    0.0000
      8 CAH         0.1300   -3.0010   13.0720 C.ar      1 P0G    0.0000

...

here in each of file I have a lot of strings:
      6 OAF        -0.8620    0.4430    6.3540 O.3       1 P0G    0.0000
      7 CAG         0.7160   -2.5530   14.2490 C.ar      1 P0G    0.0000
      8 CAH         0.1300   -3.0010   13.0720 C.ar      1 P0G    0.0000

我的任务是使用一些Linux shell脚本和AFK，SED的组合来删除这些片段中的所有列，除了与我相关的前1-5列。因此处理后的示例文件应该像

# USER_IP: 37.1.62.12 INTERFACE CHARMM-GUI
@<TRIPOS>MOLECULE
lig.pdb
54 56 1 0 0
SMALL
NO_CHARGES


@<TRIPOS>ATOM
      1 CAA         2.9880    0.1910   12.9830 
      2 CAB         1.3730    1.7370   10.6500 
      3 CAC        -0.5820    0.2000   10.5350 
      4 OAD        -5.1220    5.7850    8.9220 
      5 OAE        -2.7610    6.1960    4.9010 
      6 OAF        -0.8620    0.4430    6.3540 
      7 CAG         0.7160   -2.5530   14.2490 
      8 CAH         0.1300   -3.0010   13.0720

这里的问题是，在相同类型的文件中，我应该处理的那些段之前有几个字符串（其数量可能不同）。所以唯一的想法是使用下面的字符串

@<TRIPOS>ATOM

作为引用并开始计算字符串，哪些列必须仅在此引用字符串之后处理

我会感谢几个例子和简短的解释

格列勃

Answer 1

使用GNU awk 4.0或更高版本：

gawk 'flag { split($0, f, " ", d); for(i = 1; i <= 5; ++i) printf("%s%s", d[i - 1], f[i]); print ""; next } /@<TRIPOS>ATOM/ { flag = 1 } 1' filename

大部分是为了保持格式不变;如果格式无关紧要，那么

awk 'flag { NF = 5 } /@<TRIPOS>ATOM/ { flag = 1 } 1' filename

是一种更简单的方法，适用于较旧的gawk和mawk。为了使这个工作与BSD awk，

awk 'flag { NF = 5; $1 = $1 } /@<TRIPOS>ATOM/ { flag = 1 } 1' filename

是必要的（$1 = $1只是为了强制重建线）。感谢@tripleee对此发表评论。

第二段代码只是调整字段数，导致重建行的次数减少。第一个做得更多：

flag {                              # if we're already processing lines
  split($0, f, " ", d)              # split line into array f, save delimiters
                                    # into array d

  for(i = 1; i <= 5; ++i) {         # print the first five fields separated
    printf("%s%s", d[i - 1], f[i])  # by the saved delimiters
  }
  print ""                          # add newline
  next                              # that is all.
}
                                    # if we're not processing lines yet
/@<TRIPOS>ATOM/ { flag = 1 }        # check if we should, and if so set flag
1                                   # then print line unchanged.

附录：另一种保留格式的方法是使用sed：

sed '1,/@<TRIPOS>ATOM/ ! { s/\b[[:space:]]/\n/5; s/\n.*//; }' filename

那是：

1,/@<TRIPOS>ATOM/ ! {     # For those lines that are not in the range from
                          # the beginning to the first line containing
                          # @<TRIPOS>ATOM

  s/\b[[:space:]]/\n/5    # place a newline after the fifth column
  s/\n.*//                # then remove the newline and everything after it
}

这应该适用于GNU sed和BSD sed。由于\b不是POSIX基本正则表达式的一部分，因此更多深奥的seds可能需要稍作修改：

sed '1,/@<TRIPOS>ATOM/ ! { s/\([^[:space:]]\)[[:space:]]/\1\n/5; s/\n.*//; }' filename

这基本上以相同的方式工作，但使用不同的正则表达式来识别列的结尾。

Answer 2

以下内容应该有效：

sed -n '/@<TRIPOS>ATOM/,$p' filename | tail -n +2 | tr -s " " | cut -d" " -f1-5

工作如下：

仅打印@<TRIPOS>ATOM：
之后的行
```
sed -n '/@<TRIPOS>ATOM/,$p' filename
```
省略第一行（包含@<TRIPOS>ATOM并且您不想要）：
```
tail -n +2
```
挤压列之间的额外空格：
```
tr -s " "
```
cut使用空格作为分隔符的列，抓取您需要的字段：
```
cut -d" " -f1-5
```

使用shell命令删除列

2 个答案: