我在下面有一些示例数据....
100.200.300.40 - - [02/Feb/2012:12:18:35 +0000] temp/newfolder/resource/newitem.pdf HTTP/1.0" 200 132189
100.200.300.40 - - [02/Feb/2012:12:18:35 +0000] temp/newfolder/resource/newitem.pdf HTTP/1.0" 200 106866
100.200.300.40 - - [02/Feb/2012:12:18:35 +0000] temp/newfolder/resource/newitem.pdf HTTP/1.0" 200 103461
100.200.300.40 - - [02/Feb/2012:12:19:10 +0000] temp/newfolder/resource/newitem.pdf HTTP/1.0" 200 106866
100.200.300.40 - - [02/Feb/2012:12:19:10 +0000] temp/newfolder/resource/newitem.pdf HTTP/1.0" 200 106866
100.200.300.40 - - [02/Feb/2012:12:19:10 +0000] temp/newfolder/resource/newitem.pdf HTTP/1.0" 200 106866
300.230.100.10 - - [02/Feb/2012:12:20:55 +0000] temp/newfolder/resource/newitem.pdf HTTP/1.0" 200 28017662
200.100.600.30 - - [02/Feb/2012:12:27:22 +0000] temp/newfolder/resource/newitem.pdf HTTP/1.1" 200 464787
200.100.600.30 - - [02/Feb/2012:12:27:22 +0000] temp/newfolder/resource/newitem.pdf HTTP/1.1" 200 463747
200.100.600.30 - - [02/Feb/2012:12:27:22 +0000] temp/newfolder/resource/newitem.pdf HTTP/1.1" 200 434485
200.100.600.30 - - [02/Feb/2012:12:27:54 +0000] temp/newfolder/resource/newitem.pdf HTTP/1.1" 200 330269
664.387.880.60 - - [02/Feb/2012:12:32:03 +0000] temp/newfolder/resource/newitem.pdf HTTP/1.1" 200 266372
664.387.880.60 - - [02/Feb/2012:12:32:34 +0000] temp/newfolder/resource/newitem.pdf HTTP/1.1" 200 176348
我正在尝试使用正则表达式来查找与以下内容匹配的行....
然后我想要删除这些行,因此只剩下一行,最后有一个文件大小略有不同,否则我可以删除重复的行。
我设法检测到IP http://regexr.com?2vv5c,但就我而言,有人可以帮忙吗?
更新 在回复留下的评论时,以下原始数据......
100.200.300.40 - - [02/Feb/2012:12:18:35 +0000] temp/newfolder/resource/newitem.pdf HTTP/1.0" 200 132189
100.200.300.40 - - [02/Feb/2012:12:18:35 +0000] temp/newfolder/resource/newitem.pdf HTTP/1.0" 200 106866
100.200.300.40 - - [02/Feb/2012:12:18:35 +0000] temp/newfolder/resource/newitem.pdf HTTP/1.0" 200 103461
100.200.300.40 - - [02/Feb/2012:12:19:10 +0000] temp/newfolder/resource/newitem.pdf HTTP/1.0" 200 106866
100.200.300.40 - - [02/Feb/2012:12:19:10 +0000] temp/newfolder/resource/newitem.pdf HTTP/1.0" 200 106866
100.200.300.40 - - [02/Feb/2012:12:19:10 +0000] temp/newfolder/resource/newitem.pdf HTTP/1.0" 200 106866
300.230.100.10 - - [02/Feb/2012:12:20:55 +0000] temp/newfolder/resource/newitem.pdf HTTP/1.0" 200 28017662
200.100.600.30 - - [02/Feb/2012:12:27:22 +0000] temp/newfolder/resource/newitem.pdf HTTP/1.1" 200 464787
200.100.600.30 - - [02/Feb/2012:12:27:22 +0000] temp/newfolder/resource/newitem.pdf HTTP/1.1" 200 463747
200.100.600.30 - - [02/Feb/2012:12:27:22 +0000] temp/newfolder/resource/newitem.pdf HTTP/1.1" 200 434485
200.100.600.30 - - [02/Feb/2012:12:27:54 +0000] temp/newfolder/resource/newitem.pdf HTTP/1.1" 200 330269
664.387.880.60 - - [02/Feb/2012:12:32:03 +0000] temp/newfolder/resource/newitem.pdf HTTP/1.1" 200 266372
664.387.880.60 - - [02/Feb/2012:12:32:34 +0000] temp/newfolder/resource/newitem.pdf HTTP/1.1" 200 176348
以下内容应该保留....
100.200.300.40 - - [02/Feb/2012:12:18:35 +0000] temp/newfolder/resource/newitem.pdf HTTP/1.0" 200 103461
100.200.300.40 - - [02/Feb/2012:12:19:10 +0000] temp/newfolder/resource/newitem.pdf HTTP/1.0" 200 106866
300.230.100.10 - - [02/Feb/2012:12:20:55 +0000] temp/newfolder/resource/newitem.pdf HTTP/1.0" 200 28017662
200.100.600.30 - - [02/Feb/2012:12:27:22 +0000] temp/newfolder/resource/newitem.pdf HTTP/1.1" 200 434485
200.100.600.30 - - [02/Feb/2012:12:27:54 +0000] temp/newfolder/resource/newitem.pdf HTTP/1.1" 200 330269
664.387.880.60 - - [02/Feb/2012:12:32:03 +0000] temp/newfolder/resource/newitem.pdf HTTP/1.1" 200 266372
664.387.880.60 - - [02/Feb/2012:12:32:34 +0000] temp/newfolder/resource/newitem.pdf HTTP/1.1" 200 176348
答案 0 :(得分:1)
看起来可以使用sort -u
获得所需的结果:
$ sort -k1,6 -u < test.txt
100.200.300.40 - - [02/Feb/2012:12:18:35 +0000] temp/newfolder/resource/newitem.pdf HTTP/1.0" 200 132189
100.200.300.40 - - [02/Feb/2012:12:19:10 +0000] temp/newfolder/resource/newitem.pdf HTTP/1.0" 200 106866
200.100.600.30 - - [02/Feb/2012:12:27:22 +0000] temp/newfolder/resource/newitem.pdf HTTP/1.1" 200 464787
200.100.600.30 - - [02/Feb/2012:12:27:54 +0000] temp/newfolder/resource/newitem.pdf HTTP/1.1" 200 330269
300.230.100.10 - - [02/Feb/2012:12:20:55 +0000] temp/newfolder/resource/newitem.pdf HTTP/1.0" 200 28017662
664.387.880.60 - - [02/Feb/2012:12:32:03 +0000] temp/newfolder/resource/newitem.pdf HTTP/1.1" 200 266372
664.387.880.60 - - [02/Feb/2012:12:32:34 +0000] temp/newfolder/resource/newitem.pdf HTTP/1.1" 200 176348
这个想法是将字段1-6定义为键,并要求排序在排序后返回uniq列表。然后它每个键只返回一个uniq条目(ip + time + file)。将uniq列为基于键定义的列表,而不是整行。
答案 1 :(得分:0)
因此,要了解样本数据中的重复行,您可以执行以下操作:
String subString = line.substring(0, line.lastIndexOf(" "));
其中line
是从您的示例数据中读取的单行。删除位于每行数据末尾的subString
后,String
为filesize
。现在,您可以轻松地比较所有此类subString
并删除重复项。
答案 2 :(得分:0)
这是一个Perl单行工作:
perl -ane '($id)=$_=~/^(.*) HTTP/;print unless exists $seen{$id};$seen{$id}=1;' logfile.log
它将从^
开头到HTTP
的字符串视为哈希%seen
的键。
如果哈希中没有该键,则打印当前行。
<强>输出:强>
100.200.300.40 - - [02/Feb/2012:12:18:35 +0000] temp/newfolder/resource/newitem.pdf HTTP/1.0" 200 132189
100.200.300.40 - - [02/Feb/2012:12:19:10 +0000] temp/newfolder/resource/newitem.pdf HTTP/1.0" 200 106866
300.230.100.10 - - [02/Feb/2012:12:20:55 +0000] temp/newfolder/resource/newitem.pdf HTTP/1.0" 200 28017662
200.100.600.30 - - [02/Feb/2012:12:27:22 +0000] temp/newfolder/resource/newitem.pdf HTTP/1.1" 200 464787
200.100.600.30 - - [02/Feb/2012:12:27:54 +0000] temp/newfolder/resource/newitem.pdf HTTP/1.1" 200 330269
664.387.880.60 - - [02/Feb/2012:12:32:03 +0000] temp/newfolder/resource/newitem.pdf HTTP/1.1" 200 266372
664.387.880.60 - - [02/Feb/2012:12:32:34 +0000] temp/newfolder/resource/newitem.pdf HTTP/1.1" 200 176348