如何按与另一组文件对应的行数拆分文本文件?

时间:2014-12-18 18:57:09

标签: bash unix awk sed wc

根据列表中的数字将文件剪切成多个文件:

$ wc -l all.txt
    8500   all.txt

$ wc -l STS.*.txt  
   2000 STS.input.answers-forums.txt
   1500 STS.input.answers-students.txt
   2000 STS.input.belief.txt
   1500 STS.input.headlines.txt
   1500 STS.input.images.txt

如何将all.txt拆分为否。 STS.*.txt的行,然后将它们保存到相应的STS.output.*.txt

我一直在手动这样做:

$ sed '1,2000!d' all.txt > STS.output.answers-forums.txt
$ sed '2001,3500!d' all.txt > STS.output.answers-students.txt
$ sed '3501,5500!d' all.txt > STS.output.belief.txt
$ sed '5501,7000!d' all.txt > STS.output.headlines.txt
$ sed '7001,8500!d' all.txt > STS.output.images.txt

all.txt输入看起来像这样:

$ head all.txt
2.3059
2.2371
2.1277
2.1261
2.0576
2.0141
2.0206
2.0397
1.9467
1.8518

有时all.txt看起来像这样:

$ head all.txt
2.3059  92.123
2.2371  1.123
2.1277  0.12452
2.1261123   213
2.0576  100
2.0141  0
2.02062 1
2.03972 34.123
1.9467  9.23
1.8518  9123.1

对于STS。* .txt,它们只是纯文本行,例如:

$ head STS.output.answers-forums.txt
The problem likely will mean corrective changes before the shuttle fleet starts flying again.   He said the problem needs to be corrected before the space shuttle fleet is cleared to fly again.
The technology-laced Nasdaq Composite Index .IXIC inched down 1 point, or 0.11 percent, to 1,650.   The broad Standard & Poor's 500 Index .SPX inched up 3 points, or 0.32 percent, to 970.
"It's a huge black eye," said publisher Arthur Ochs Sulzberger Jr., whose family has controlled the paper since 1896.   "It's a huge black eye," Arthur Sulzberger, the newspaper's publisher, said of the scandal.

2 个答案:

答案 0 :(得分:1)

我建议写一个循环:

for file in answers-forums answers-students belief headlines images; do
    lines=$(wc -l < "STS.input.$file.txt")
    sed "$(( total + 1 )),$(( total + lines ))!d" all.txt > "STS.output.$file.txt"
    (( total += lines ))
done

total跟踪到目前为止已读取的行数。 sed命令从total + 1提取行到total + lines,将它们写入相应的输出文件。

答案 1 :(得分:1)

希望你发布了一些示例输入,用于将输入文件(例如,10行)拆分为输出文件,例如,2,3和5行,而不是8500行....这样就可以给我们测试解决方案的东西。哦,这可能有用,但当然没有经过测试:

awk '
ARGIND < (ARGC-1) { outfile[NR] = gensub(/input/,"output","",FILENAME); next }
{ print > outfile[FNR] }
' STS.input.* all.txt

以上使用的GNU awk用于ARGIND和gensub()。

它只是创建一个数组,将所有“输入”文件中的每个行号映射到“输出”文件的名称,该文件应写入相同行号“all.txt”。

任何时候你在shell中编写循环只是为了操作文本你都有错误的方法。创建shell的人也为shell创建了awk来调用操作文本,所以就这样做。