Question

我们在这里保持 n = 3，并说我有两个文件：

file1.txt

a b c row1
d e f row2
g h i row3
j k l row4
m n o row5
o q r row6
s t u row7
v w x row8
y z Z row9

file2.txt

1 2 3
4 5 6
7 8 9

我想将这两个文件合并为new_file.txt：

new_file.txt

a b c 2 3
d e f 2 3
g h i 2 3
j k l 5 6
m n o 5 6
o q r 5 6
s t u 8 9
v w x 8 9
y z Z 8 9

目前我按照以下方式执行此操作（当然还有慢速bash for或while循环解决方案）：awk '1;1;1' file2.txt > tmp2.txt然后针对此案awk 'FNR==NR{a[FNR]=$2" "$3;next};{$NF=a[FNR]};1' tmp2.txt file1.txt > new_file.txt列在我的问题中。

或者将这些放在一行中：awk '1;1;1' file2.txt | awk 'FNR==NR{a[FNR]=$2" "$3;next};{$NF=a[FNR]};1' - file1.txt > new_file.txt。但这些看起来并不优雅......

我正在寻找能够有效做到这一点的更优雅的单线（也许是awk）。

在实际情况中，假设我在输入file1.txt中有900万行，在输入file2.txt中有300万行，我想将第一行file2.txt的第2列和第3列附加为file1.txt的前3行的新的最后一列，file2.txt的第二行的第2列和第3列，作为接下来的3行file1.txt等的新的最后一列，等等。

谢谢！

Answer 1

试试这个，有关<()语法

的详细信息，请参阅mywiki.wooledge - Process Substitution

$ # transforming file2
$ cut -d' ' -f2-3 file2.txt | sed 'p;p'
2 3
2 3
2 3
5 6
5 6
5 6
8 9
8 9
8 9

$ # then paste it together with required fields from file1
$ paste -d' ' <(cut -d' ' -f1-3 file1.txt) <(cut -d' ' -f2-3 file2.txt | sed 'p;p')
a b c 2 3
d e f 2 3
g h i 2 3
j k l 5 6
m n o 5 6
o q r 5 6
s t u 8 9
v w x 8 9
y z Z 8 9

速度比较，连续两次运行显示的时间

$ perl -0777 -ne 'print $_ x 1000000' file1.txt > f1
$ perl -0777 -ne 'print $_ x 1000000' file2.txt > f2
$ du -h f1 f2
95M f1
18M f2


$ time paste -d' ' <(cut -d' ' -f1-3 f1) <(cut -d' ' -f2-3 f2 | sed 'p;p') > t1

real    0m1.362s
real    0m1.154s

$ time awk '1;1;1' f2 | awk 'FNR==NR{a[FNR]=$2" "$3;next};{$NF=a[FNR]};1' - f1 > t2

real    0m12.088s
real    0m13.028s

$ time awk '{ 
         if (c==3) c=0; 
         printf "%s %s %s ",$1,$2,$3; 
         if (!c++){ getline < "f2"; f4=$2; f5=$3 } 
         printf "%s %s\n",f4,f5 
     }' f1 > t3

real    0m13.629s
real    0m13.380s

$ time awk '{ 
         if (c==3) c=0; 
         main_fields=$1 OFS $2 OFS $3; 
         if (!c++){ getline < "f2"; f4=$2; f5=$3 } 
         printf "%s %s %s\n", main_fields, f4, f5 
     }' f1 > t4

real    0m13.265s 
real    0m13.896s

$ diff -s t1 t2
Files t1 and t2 are identical
$ diff -s t1 t3
Files t1 and t3 are identical
$ diff -s t1 t4
Files t1 and t4 are identical

Answer 2

Awk 解决方案：

awk '{ 
         if (c==3) c=0; 
         main_fields=$1 OFS $2 OFS $3; 
         if (!c++){ getline < "file2.txt"; f4=$2; f5=$3 } 
         printf "%s %s %s\n", main_fields, f4, f5 
     }' file1.txt

c - 反映 nth 系数的变量
getline < file - 从文件中读取下一条记录
f4=$2; f5=$3 - 包含当前读取的file2.txt

输出：

a b c 2 3
d e f 2 3
g h i 2 3
j k l 5 6
m n o 5 6
o q r 5 6
s t u 8 9
v w x 8 9
y z Z 8 9

Answer 3

这仍然比Sundeep在100,000线测试中的剪切和粘贴代码慢得多（8s vs 21s在我的笔记本电脑上），但可能比其他Awk解决方案更容易理解。（不过我必须先玩一下才能获得正确的索引。）

awk 'NR==FNR { a[FNR] = $2 " " $3; next }
    { print $1, $2, $3, a[1+int((FNR-1)/3)] }' file2.txt file1.txt

这只是将file2.txt（{1}}的相关部分保留在内存中，然后读取file1.txt并写出合并的行。这也意味着它受可用内存的限制，而Roman的解决方案将扩展到基本上任意大的文件（只要每条线都适合内存！）但稍快一些（我使用Sundeep的100k测试数据获得28s的罗马剧本实时）。

awk将一个文件的行作为新列插入到另一个文件的每第n行

3 个答案: