Bash脚本操纵txt文件以打破特定号码的行?

时间:2016-02-24 12:24:34

标签: bash

我有一个巨大的.txt文件(差不多8000000个字符)。它们包含所有具有相同ID号的应用程序。

每次出现此特定应用程序编号时,我希望在此txt文件中突破到新行。

我如何以最聪明的方式解决这个问题?一个在Windows中运行的bash脚本是可取的,但是用这么大的文件做这个的好顺序是什么?

示例输入:

12345 123451234512345AAAAAA 12345AAAA 08:00:00NAMENAME AA NAME NAME ADRESS 11 1122 - 我希望脚本在每次12345发生时将文本拆分为新行,并包括所有内容,直到下一个12345发生,如果这有意义的话! / p>

但是,有些输入中间可能没有空格......所以它可能是12345123555123453413412345AAAAAA ..如何处理这个呢?

1 个答案:

答案 0 :(得分:1)

鉴于您输入了一些非常长的文件,其中包含以下行:

$ cat filename.txt
12345 123451234512345AAAAAA 12345AAAA 08:00:00NAMENAME AA NAME NAME ADRESS 11 1122 12345 123451234512345AAAAAA 12345AAAA 08:00:00NAMENAME AA NAME NAME ADRESS 11 1122 12345 123451234512345AAAAAA 12345AAAA 08:00:00NAMENAME AA NAME NAME ADRESS 11 1122 12345 123451234512345AAAAAA 12345AAAA 08:00:00NAMENAME AA NAME NAME ADRESS 11 1122 12345 123451234512345AAAAAA 12345AAAA 08:00:00NAMENAME AA NAME NAME ADRESS 11 1122 12345 123451234512345AAAAAA 12345AAAA 08:00:00NAMENAME AA NAME NAME ADRESS 11 1122 12345 123451234512345AAAAAA 12345AAAA 08:00:00NAMENAME AA NAME NAME ADRESS 11 1122 12345 123451234512345AAAAAA 12345AAAA 08:00:00NAMENAME AA NAME NAME ADRESS 11 1122 12345 123451234512345AAAAAA 12345AAAA 08:00:00NAMENAME AA NAME NAME ADRESS 11 1122 12345 123451234512345AAAAAA 12345AAAA 08:00:00NAMENAME AA NAME NAME ADRESS 11 1122 12345 123451234512345AAAAAA 12345AAAA 08:00:00NAMENAME AA NAME NAME ADRESS 11 1122 12345 123451234512345AAAAAA 12345AAAA 08:00:00NAMENAME AA NAME NAME ADRESS 11 1122 12345 123451234512345AAAAAA 12345AAAA 08:00:00NAMENAME AA NAME NAME ADRESS 11 1122 12345 123451234512345AAAAAA 12345AAAA 08:00:00NAMENAME AA NAME NAME ADRESS 11 1122 12345 123451234512345AAAAAA 12345AAAA 08:00:00NAMENAME AA NAME NAME ADRESS 11 1122 12345 123451234512345AAAAAA 12345AAAA 08:00:00NAMENAME AA NAME NAME ADRESS 11 1122 12345 123451234512345AAAAAA 12345AAAA 08:00:00NAMENAME AA NAME NAME ADRESS 11 1122

您可以使用sed -e 's/\s\(12345\)\b/\n\1/g' filename.txt来打破'12345'开头的每一行(单独而不是另一个单词的一部分),例如:

$ sed -e 's/\s\(12345\)\b/\n\1/g' filename.txt
12345 123451234512345AAAAAA 12345AAAA 08:00:00NAMENAME AA NAME NAME ADRESS 11 1122
12345 123451234512345AAAAAA 12345AAAA 08:00:00NAMENAME AA NAME NAME ADRESS 11 1122
12345 123451234512345AAAAAA 12345AAAA 08:00:00NAMENAME AA NAME NAME ADRESS 11 1122
12345 123451234512345AAAAAA 12345AAAA 08:00:00NAMENAME AA NAME NAME ADRESS 11 1122
12345 123451234512345AAAAAA 12345AAAA 08:00:00NAMENAME AA NAME NAME ADRESS 11 1122
12345 123451234512345AAAAAA 12345AAAA 08:00:00NAMENAME AA NAME NAME ADRESS 11 1122
12345 123451234512345AAAAAA 12345AAAA 08:00:00NAMENAME AA NAME NAME ADRESS 11 1122
12345 123451234512345AAAAAA 12345AAAA 08:00:00NAMENAME AA NAME NAME ADRESS 11 1122
12345 123451234512345AAAAAA 12345AAAA 08:00:00NAMENAME AA NAME NAME ADRESS 11 1122
12345 123451234512345AAAAAA 12345AAAA 08:00:00NAMENAME AA NAME NAME ADRESS 11 1122
12345 123451234512345AAAAAA 12345AAAA 08:00:00NAMENAME AA NAME NAME ADRESS 11 1122
12345 123451234512345AAAAAA 12345AAAA 08:00:00NAMENAME AA NAME NAME ADRESS 11 1122
12345 123451234512345AAAAAA 12345AAAA 08:00:00NAMENAME AA NAME NAME ADRESS 11 1122
12345 123451234512345AAAAAA 12345AAAA 08:00:00NAMENAME AA NAME NAME ADRESS 11 1122
12345 123451234512345AAAAAA 12345AAAA 08:00:00NAMENAME AA NAME NAME ADRESS 11 1122
12345 123451234512345AAAAAA 12345AAAA 08:00:00NAMENAME AA NAME NAME ADRESS 11 1122
12345 123451234512345AAAAAA 12345AAAA 08:00:00NAMENAME AA NAME NAME ADRESS 11 1122

您可以使用sed -i.bak选项在filename.txt.bak中创建原始文件备份时更改文件,也可以使用sed -i省略备份。对于测试,您可以使用sed -e ... | tail -n10查看sed表达式生成的前10行。

如果你想要一个脚本,你提供filename来搜索,token打破这一行,你可以这样做:

#!/bin/sh

[ -z "$1" -o -z "$2" ] && {  ## validate 2 arguments given
    printf "error: insufficient input, usage: %s file token\n" "${0//*\/}"
    exit 1
}

[ -f "$1" ] || {  ## validate the first is a filename
    printf "error: invalid filename '%s' (file not found).\n" "$1"
    exit 1
}

## call the sed command
sed -e "s/\s\(${2}\)\b/\n\1/g" "$1"