Question

我有成千上万个制表符分隔的数据文件，每个文件都是：

a0\ta1\ta2\ta3\ta4\ta5\ta6\ta7\ta8\ta9\n
b0\tb1\tb2\tb3\tb4\tb5\tb6\tb7\tb8\tb9\n
...

但是，偶尔会有包含（随机）格式错误的行的文件，如：

a0\ta1\ta2\ta3_0\n
a3_1\ta4\ta5\ta6\ta7\ta8\ta9\n
b0\tb1\tb2_0\n
b2_1\tb3\tb4\tb5\tb6\tb7\tb8\tb9\n
...

其中a3_0，a3_1（b2_0，b2_1分别是a3（b2 resp。）的一部分，最初由一个白色的空间。我希望仅在该行太短或\n太少时，用一个空格替换行尾的每个\t。目前5似乎是一个安全的门槛。

我经常使用sed进行一些修改，这比上面的要简单得多。我想知道sed或其他一些命令（如awk？我还需要学习）是否可用于快速处理（因为我有很多文件）。感谢。

Answer 1

使用GNU awk进行多字符RS和RT（以及后来的-i infile和ENDFILE）并使用逗号代替标签以获取可见性：

$ cat file
a0,a1,a2,a3,a4,a5,a6,a7,a8,a9
b0,b1,b2,b3,b4,b5,b6,b7,b8,b9
a0,a1,a2,a3_0
a3_1,a4,a5,a6,a7,a8,a9
b0,b1,b2_0
b2_1,b3,b4,b5,b6,b7,b8,b9

$ awk -v RS='([^,]*,){9}[^\n]*\n' '{$0=RT; sub(/\n$/,"") gsub(/\n/," ")} 1' file
a0,a1,a2,a3,a4,a5,a6,a7,a8,a9
b0,b1,b2,b3,b4,b5,b6,b7,b8,b9
a0,a1,a2,a3_0 a3_1,a4,a5,a6,a7,a8,a9
b0,b1,b2_0 b2_1,b3,b4,b5,b6,b7,b8,b9

以上[ab-]使用RS将每个记录（而不是记录分隔符）描述为一系列以逗号分隔的10个以逗号分隔的字段，然后在打印前在每个记录中根据需要替换换行符。

只需将RS='([^,]*,){9}[^\n]*\n'更改为RS='([^\t]*\t){9}[^\n]*\n'即可使用制表符分隔字段而非逗号分隔字段。

要对所有文件进行更改，请添加-i inplace：

awk -i inplace -v RS='...' '...' *

或：

find ... -exec awk -i inplace -v RS='...' '...' {} +

你实际上甚至不需要对RS进行硬编码，该工具可以解决这个问题，假设每个输入文件中至少有一条完整的行：

$ awk -F',' '
    BEGIN { ARGV[ARGC] = ARGV[ARGC-1]; ARGC++ }
    NR==FNR { n=(NF>n?NF:n); next }
    ENDFILE { RS="([^"FS"]*"FS"){"n-1"}[^\n]*\n" }
    { $0=RT; sub(/\n$/,"") gsub(/\n/," "); print }
' file
a0,a1,a2,a3,a4,a5,a6,a7,a8,a9
b0,b1,b2,b3,b4,b5,b6,b7,b8,b9
a0,a1,a2,a3_0 a3_1,a4,a5,a6,a7,a8,a9
b0,b1,b2_0 b2_1,b3,b4,b5,b6,b7,b8,b9

只需将-F','更改为-F'\t'，以便以制表符分隔。

与POSIX awks合作，上述两个gawk脚本中最接近的等价物是：

$ awk '
    { rec=rec $0 RS }
    END{
        while ( match(rec,/([^,]*,){9}[^\n]*\n/) ) {
            tgt = substr(rec,RSTART,RLENGTH)
            sub(/\n$/,"",tgt)
            gsub(/\n/," ",tgt)
            print tgt
            rec = substr(rec,RSTART+RLENGTH)
        }
    }
' file
a0,a1,a2,a3,a4,a5,a6,a7,a8,a9
b0,b1,b2,b3,b4,b5,b6,b7,b8,b9
a0,a1,a2,a3_0 a3_1,a4,a5,a6,a7,a8,a9
b0,b1,b2_0 b2_1,b3,b4,b5,b6,b7,b8,b9

和

awk -F',' '
    { rec=rec $0 RS; n=(NF>n?NF:n) }
    END{
        while ( match(rec,"([^"FS"]*"FS"){"n-1"}[^\n]*\n") ) {
            tgt = substr(rec,RSTART,RLENGTH)
            sub(/\n$/,"",tgt)
            gsub(/\n/," ",tgt)
            print tgt
            rec = substr(rec,RSTART+RLENGTH)
        }
    }
' file
a0,a1,a2,a3,a4,a5,a6,a7,a8,a9
b0,b1,b2,b3,b4,b5,b6,b7,b8,b9
a0,a1,a2,a3_0 a3_1,a4,a5,a6,a7,a8,a9
b0,b1,b2_0 b2_1,b3,b4,b5,b6,b7,b8,b9

请注意，那些人在主要处理开始之前将整个文件读入一个字符串，所以如果你的文件太大而无法容纳在内存中，他们就会失败但你已经告诉我们每个文件都是＆＃34;非常小＆＃34;所以不应该成为一个问题。

要覆盖输入文件，最简单的方法始终是：

awk '{...}' file > tmp && mv tmp file

但在这种情况下，您可以选择：

awk '{...} END{... print tgt > ARGV[1] ...}' file

在这种情况下有效，因为awk在启动END部分之前已经完成了读取输入文件。不要在脚本的其他地方尝试它。

Answer 2

假设您将以下脚本命名为repiece：

#!/usr/bin/env bash

IFS=$'\t'       # use tab separators throughout this script
rIFS=,          # except to avoid field coalescing, use commas
pieces_needed=5 # adjust this to taste

for arg; do
  tempfile="${arg}.tmp-$$" # vulnerable to symlink attacks; use mktemp instead if untrusted
                           # users have write access to current directory.
  deferred=( )
  {
    while IFS="$rIFS" read -r -a pieces; do
      if (( ( ${#deferred[@]} + ${#pieces[@]} ) < pieces_needed )); then
        deferred+=( "${pieces[@]}" )
      elif (( ${#deferred[@]} )); then
        # separate last piece of deferred and first of pieces with a space
        all_pieces=( "${deferred[@]} ${pieces[@]}" )
        printf '%s\n' "${all_pieces[*]}"
        deferred=( )
      else
        printf '%s\n' "${pieces[*]}"
      fi
    done
    # if we have anything deferred for the last line, print it now
    (( ${#deferred[@]} )) && printf '%s\n' "${deferred[*]}"
  } < <(tr -- "$IFS" "$rIFS" <"$arg") >"$tempfile"
  mv -- "$tempfile" "$arg"
done

...您可以调用尽可能少的调用来处理所有文件，如下所示：

# if your files end in .tsv
find . -type f -name '*.tsv' -exec ./repiece {} +

Answer 3

在awk中，更改空格与ORS之间的\ņ：

$ awk '
BEGIN { 
    FS=OFS="\t"       # set field separators
    RS=ORS="\n"       # set record separators
}
NF<=5 {               # if below or at threshold
    ORS=" "           # redefine output record separator
}
{
    print             # print record with ORS
    ORS="\n"          # reset ORS back to newline
}' file
a0      a1      a2      a3      a4      a5      a6      a7      a8      a9
b0      b1      b2      b3      b4      b5      b6      b7      b8      b9
a0      a1      a2      a3_0 a3_1       a4      a5      a6      a7      a8      a9
b0      b1      b2_0 b2_1       b3      b4      b5      b6      b7      b8      b9

使用shell脚本处理多个文件：

$ for f in file1 file2 ; do awk ... $f > new-$f ; done

如果需要，请引用$f。

Answer 4

这可能适合你（GNU sed）：

sed ':a;s/\t/&/9;t;N;s/\n/ /;ta' file

如果当前行中的标签少于9个，请附加下一行并用空格替换换行符。重复，直到9个或更多标签。

仅当'\ t'的出现次数超过数字时，才用'空格'替换'\ n'

4 个答案: