Question

我想根据一个列值将大文件（1.85亿条记录）拆分为多个文件。文件是.dat文件，列之间使用的分隔符是^ A（\ u0001）。

文件内容如下：

194^A1^A091502^APR^AKIMBERLY^APO83^A^A^A^A0183^AUSA^A^A^A^A^A^A^A^A
194^A1^A091502^APR^AJOHN^APO83^A^A^A^A0183^AUSA^A^A^A^A^A^A^A^A
194^A^A091502^APR^AASHLEY^APO83^A^A^A^A0183^AUSA^A^A^A^A^A^A^A^A
194^A3^A091502^APR^APETER^APO83^A^A^A^A0183^AUSA^A^A^A^A^A^A^A^A
194^A4^A091502^APR^AJOE^APO83^A^A^A^A0183^AUSA^A^A^A^A^A^A^A^A

现在我想根据第二列值拆分文件，如果你看到第三行第二列值为空，那么所有空行应该是一个文件，剩下的都应该是一个文件。

请帮我解决这个问题。我试图谷歌，似乎我们应该使用awk。

此致香卡

Answer 1

使用awk：

awk -F '\x01' '$2 == "" { print > "empty.dat"; next } { print > "normal.dat" }' filename

当然，文件名可以任意选择。 print > "file"将当前记录打印到名为"file"的文件。

补遗：评论：删除专栏有点棘手，但肯定是可行的。我用

awk -F '\x01' 'BEGIN { OFS = FS } { fname = $2 == "" ? "empty.dat" : "normal.dat"; for(i = 2; i < NF; ++i) $i = $(i + 1); --NF; print > fname }' filename

其工作原理如下：

BEGIN {                                          # output field separator is
  OFS = FS                                       # the same as input field
                                                 # separator, so that the
                                                 # rebuilt lines are formatted
                                                 # just like they came in
}
{
  fname = $2 == "" ? "empty.dat" : "normal.dat"  # choose file name

  for(i = 2; i < NF; ++i) {                      # set all fields after the
    $i = $(i + 1)                                # second back one position
  }

  --NF                                           # let awk know the last field
                                                 # is not needed in the output

  print > fname                                  # then print to file.
}

根据列值拆分大文件--linux

1 个答案: