我有一个巨大的.txt文件(差不多8000000个字符)。它们包含所有具有相同ID号的应用程序。
每次出现此特定应用程序编号时,我希望在此txt文件中突破到新行。
我如何以最聪明的方式解决这个问题?一个在Windows中运行的bash脚本是可取的,但是用这么大的文件做这个的好顺序是什么?
示例输入:
12345 123451234512345AAAAAA 12345AAAA 08:00:00NAMENAME AA NAME NAME ADRESS 11 1122
- 我希望脚本在每次12345
发生时将文本拆分为新行,并包括所有内容,直到下一个12345
发生,如果这有意义的话! / p>
但是,有些输入中间可能没有空格......所以它可能是12345123555123453413412345AAAAAA
..如何处理这个呢?
答案 0 :(得分:1)
鉴于您输入了一些非常长的文件,其中包含以下行:
$ cat filename.txt
12345 123451234512345AAAAAA 12345AAAA 08:00:00NAMENAME AA NAME NAME ADRESS 11 1122 12345 123451234512345AAAAAA 12345AAAA 08:00:00NAMENAME AA NAME NAME ADRESS 11 1122 12345 123451234512345AAAAAA 12345AAAA 08:00:00NAMENAME AA NAME NAME ADRESS 11 1122 12345 123451234512345AAAAAA 12345AAAA 08:00:00NAMENAME AA NAME NAME ADRESS 11 1122 12345 123451234512345AAAAAA 12345AAAA 08:00:00NAMENAME AA NAME NAME ADRESS 11 1122 12345 123451234512345AAAAAA 12345AAAA 08:00:00NAMENAME AA NAME NAME ADRESS 11 1122 12345 123451234512345AAAAAA 12345AAAA 08:00:00NAMENAME AA NAME NAME ADRESS 11 1122 12345 123451234512345AAAAAA 12345AAAA 08:00:00NAMENAME AA NAME NAME ADRESS 11 1122 12345 123451234512345AAAAAA 12345AAAA 08:00:00NAMENAME AA NAME NAME ADRESS 11 1122 12345 123451234512345AAAAAA 12345AAAA 08:00:00NAMENAME AA NAME NAME ADRESS 11 1122 12345 123451234512345AAAAAA 12345AAAA 08:00:00NAMENAME AA NAME NAME ADRESS 11 1122 12345 123451234512345AAAAAA 12345AAAA 08:00:00NAMENAME AA NAME NAME ADRESS 11 1122 12345 123451234512345AAAAAA 12345AAAA 08:00:00NAMENAME AA NAME NAME ADRESS 11 1122 12345 123451234512345AAAAAA 12345AAAA 08:00:00NAMENAME AA NAME NAME ADRESS 11 1122 12345 123451234512345AAAAAA 12345AAAA 08:00:00NAMENAME AA NAME NAME ADRESS 11 1122 12345 123451234512345AAAAAA 12345AAAA 08:00:00NAMENAME AA NAME NAME ADRESS 11 1122 12345 123451234512345AAAAAA 12345AAAA 08:00:00NAMENAME AA NAME NAME ADRESS 11 1122
您可以使用sed -e 's/\s\(12345\)\b/\n\1/g' filename.txt
来打破'12345'
开头的每一行(单独而不是另一个单词的一部分),例如:
$ sed -e 's/\s\(12345\)\b/\n\1/g' filename.txt
12345 123451234512345AAAAAA 12345AAAA 08:00:00NAMENAME AA NAME NAME ADRESS 11 1122
12345 123451234512345AAAAAA 12345AAAA 08:00:00NAMENAME AA NAME NAME ADRESS 11 1122
12345 123451234512345AAAAAA 12345AAAA 08:00:00NAMENAME AA NAME NAME ADRESS 11 1122
12345 123451234512345AAAAAA 12345AAAA 08:00:00NAMENAME AA NAME NAME ADRESS 11 1122
12345 123451234512345AAAAAA 12345AAAA 08:00:00NAMENAME AA NAME NAME ADRESS 11 1122
12345 123451234512345AAAAAA 12345AAAA 08:00:00NAMENAME AA NAME NAME ADRESS 11 1122
12345 123451234512345AAAAAA 12345AAAA 08:00:00NAMENAME AA NAME NAME ADRESS 11 1122
12345 123451234512345AAAAAA 12345AAAA 08:00:00NAMENAME AA NAME NAME ADRESS 11 1122
12345 123451234512345AAAAAA 12345AAAA 08:00:00NAMENAME AA NAME NAME ADRESS 11 1122
12345 123451234512345AAAAAA 12345AAAA 08:00:00NAMENAME AA NAME NAME ADRESS 11 1122
12345 123451234512345AAAAAA 12345AAAA 08:00:00NAMENAME AA NAME NAME ADRESS 11 1122
12345 123451234512345AAAAAA 12345AAAA 08:00:00NAMENAME AA NAME NAME ADRESS 11 1122
12345 123451234512345AAAAAA 12345AAAA 08:00:00NAMENAME AA NAME NAME ADRESS 11 1122
12345 123451234512345AAAAAA 12345AAAA 08:00:00NAMENAME AA NAME NAME ADRESS 11 1122
12345 123451234512345AAAAAA 12345AAAA 08:00:00NAMENAME AA NAME NAME ADRESS 11 1122
12345 123451234512345AAAAAA 12345AAAA 08:00:00NAMENAME AA NAME NAME ADRESS 11 1122
12345 123451234512345AAAAAA 12345AAAA 08:00:00NAMENAME AA NAME NAME ADRESS 11 1122
您可以使用sed -i.bak
选项在filename.txt.bak
中创建原始文件备份时更改文件,也可以使用sed -i
省略备份。对于测试,您可以使用sed -e ... | tail -n10
查看sed
表达式生成的前10行。
如果你想要一个脚本,你提供filename
来搜索,token
打破这一行,你可以这样做:
#!/bin/sh
[ -z "$1" -o -z "$2" ] && { ## validate 2 arguments given
printf "error: insufficient input, usage: %s file token\n" "${0//*\/}"
exit 1
}
[ -f "$1" ] || { ## validate the first is a filename
printf "error: invalid filename '%s' (file not found).\n" "$1"
exit 1
}
## call the sed command
sed -e "s/\s\(${2}\)\b/\n\1/g" "$1"