Question

我发送了一个制表符分隔文件，其中自由文本注释包含未转义标签。数据中有19列，自由文本是第13列。因此，我需要找到包含＆gt; 18个标签的所有行。对于这样的行，我需要替换从行的开头起＆gt; 12并且从行的末尾起＆gt; 6的任何制表符。我想用字符串'@@'（x符号）替换它们，因为我会将标签放回到更下游。

文件是2405行，包括标题行，有些行有空格“单元格”，即彼此相邻的标签。我无法访问文档来源，供应商不知道如何修复源代码。文本是UTF- *并包含重音字符等（即非基本ASCII文本）。

任何简单的方法都能解决这个问题，我可以在运行OS 10.8.5（或必要时为10.9.x）的Mac上使用吗？

如果答案表明分裂（此处为12/6 of 19）是硬编码还是作为变量输入，可能会帮助以后的读者。

Answer 1

尝试以下awk命令：

awk -F '\t' -v expectedColCount=19  -v freeFormColNdx=13 '
  NF > expectedColCount { 
      # Calculate the number of field-embedded tabs that need replacing.
    extraTabs = NF - expectedColCount
      # Print 1st column.
    printf "%s", $1
      # Print columns up to and including the first tab-separated token
      # from the offending column.
    for (i = 2; i <= freeFormColNdx; ++i) { printf "%s%s", FS, $i }
      # Print tokens in the offending column separated with "@@" instead of tabs.
    for (i = freeFormColNdx + 1; i <= freeFormColNdx + extraTabs; ++i) { 
      printf "@@%s", $i
    }
      # Print the remaining columns.
    for (i = freeFormColNdx + extraTabs + 1; i <= NF; ++i) { 
      printf "%s%s", FS, $i
    }
      # Print terminating newline.
    printf "\n"
      # Done processing this line; move to next.
    next
  }
  1  # Line has expected # of tabs or fewer (header line))? Print as is.
  ' file

清除tab-delim文件中自由文本列的选项卡

1 个答案: