Question

我有一个大文本文件（超过70mb），需要计算文件中字符序列出现的次数。我可以找到很多脚本来完成这项工作，但是没有人会考虑到序列可以在不同的行上开始和结束。为了提高效率（我实际上有超过1个处理的文件），我无法预处理文件以删除换行符。

实施例：如果我正在搜索“thisIsTheSequence”，则以下文件将有3个匹配项：

asdasdthisIsTheSequence
asdasdasthisIsT
heSequenceasdasdthisIsTheSequ
encesadasdasda

感谢您的帮助。

Answer 1

一个选项：

echo $((`tr -d "\n" < file | sed 's/thisIsTheSequence/\n/g' | wc -l` - 1))

使用shell核心之外的实用程序可能有更高效的方法 - 特别是如果你可以将文件放在内存中。

Answer 2

只需要一个awk脚本，因为你将处理一个巨大的文件。做多个管道可能会减慢速度。

#!/bin/bash
awk 'BEGIN{
 search="thisIsTheSequence"
 total=0
}
NR%10==0{
  c=gsub(search,"",s)
  total+=c  
}
NR{ s=s $0 }
END{ 
 c=gsub(search,"",s)
 print "total count: "total+c
}' file

输出

$ more file
asdasdthisIsTheSequence
asdasdasthisIsT
heSequenceasdasdthisIsTheSequ
encesadasdasdaasdasdthisIsTheSequence
asdasdasthisIsT
heSequenceasdasdthisIsTheSequ
encesadasdasda
asdasdthisIsTheSequence
asdasdasthisIsT
heSequenceasdasdthisIsTheSequ
encesadasdasda

$ ./shell.sh
total count: 9

Answer 3

你的序列中是否会有多个换行符？

如果没有，一个解决方案是将你的序列分成两半并搜索一半（例如搜索“thisIsTh”和“eSequence”），然后回到你找到的事件并“仔细看看” “，即删除该区域的换行符并检查匹配。

基本上这是一种快速“过滤”数据以找到有趣的东西。

Answer 4

使用类似的东西：

head -n LL filename | tail -n YY | grep text | wc -l

其中LL是序列的最后一行，YY是序列中的行数（即LL - 第一行）

Linux shell脚本计算文本文件中char序列的出现？

4 个答案: