Question

我有几千兆字节的网络应用程序日志，我需要从客户端（没有保留适当的备份）中提取客户数据。

到目前为止，我已经清理了一些日志，我可以看到隧道尽头的光线。但是，我意识到有很多重复的条目，似乎每次客户使用这个Web应用程序时，相同的数据都存储在日志中，这是一个简单的例子：

initial_date=Jul-26-2015&report_center=0&last_name=bar&first_name=foo&sex=M&birthday=Sep-26-1985&sin=123456789&drivers_license=&address1=414+stackoverflow+Street&residence_type=1&address2=Apartment+103&datemovein=Feb-02-2013&postal=a1a1a1&city=townsville&prov=ontario&country=Canada&telephone=5555555555&cell_phone=5555556666

initial_date=Jan-24-2014&report_center=0&last_name=blah&first_name=steve&sex=M&birthday=aug-11-1983&sin=987654321&drivers_license=&address1=12+stackoverflow+Street&residence_type=1&address2=&datemovein=Jun-02-2011&postal=a9a9a9&city=cityville&prov=ontario&country=Canada&telephone=5551111111&cell_phone=5552222222

initial_date=Jul-26-2015&report_center=0&last_name=bar&first_name=foo&sex=M&birthday=Sep-26-1985&sin=123456789&drivers_license=&address1=414+stackoverflow+Street&residence_type=1&address2=Apartment+103&datemovein=Feb-02-2013&postal=a1a1a1&city=townsville&prov=ontario&country=Canada&telephone=5555555555&cell_phone=5555556666

我希望匹配唯一条目，最终删除其余条目。我试图使用正则表达式＆amp;积极前瞻要做的工作，但从我所读到的，似乎只有重复是连续的，有些是，但许多不是。有没有办法让我单独用正则表达式来实现这个目标？

Answer 1

没有合理的理由为此使用正则表达式; sort -u将执行您通过示例指定的内容。

在大文本文件中查找非连续重复项

1 个答案: