我有一个包含225000行的文件,其中包含许多类似的行。我希望删除所有类似的行,而只保留每个“类型”的第一行。示例如下。
我想要一个看起来像这样的文件:
./ACT_HERE_REPORT_MEMO_APPROVED_20180510_083000.log.gz
./ACT_HERE_REPORT_MEMO_APPROVED_20180512_083000.log.gz
./ACT_HERE_REPORT_MEMO_APPROVED_20180513_083000.log.gz
./ACT_HERE_REPORT_MEMO_APPROVED_20180515_083000.log.gz
./ACT_HERE_SOMETHING_MEMO_APPROVED_20180326.xls
./ACT_HERE_SOMETHING_MEMO_APPROVED_20180327.xls
./ACT_HERE_SOMETHING_MEMO_APPROVED_20180328.xls
./ACT_HERE_SOMETHING_MEMO_APPROVED_20180329.xls
./ACT_HERE_SOMETHING_MEMO_APPROVED_20180331.xls
./Archive/20150919-084501.SOMETHING
./Archive/20150922-084501.SOMETHING
./Archive/20150923-084500.SOMETHING
./Archive/20150924-084500.SOMETHING
./TEST/TEST.20170310.20170310-181017.txt.gz
./TEST/TEST.20170310.20170310-201023.txt.gz
./TEST/TEST.20170313.20170313-011035.txt.gz
./TEST/TEST.20170313.20170313-024006.txt.gz
./TEST/TEST.20170313.20170313-041018.txt.gz
./TEST/TEST.20180402-011024.log.gz
./TEST/TEST.20180402-011200.log.gz
./TEST/TEST.20180402-061113.log.gz
./TEST/TEST.20180402-081013.log.gz
./TEST/TEST.20180402-101012.log.gz
要这样结束:
./ACT_HERE_REPORT_MEMO_APPROVED_20180510_083000.log.gz
./ACT_HERE_SOMETHING_MEMO_APPROVED_20180326.xls
./Archive/20150919-084501.SOMETHING
./TEST/TEST.20170310.20170310-181017.txt.gz
./TEST/TEST.20180402-011024.log.gz
答案 0 :(得分:5)
((^.+?)[-_.\d]+(\..+\R))(?:\2[-_.\d]+\3)+
$1
. matches newline
说明:
( # start group 1
( # start group 2
^ # beginning of line
.+? # 1 or more any character but newline, not greedy
) # end group 2
[-_.\d]+ # 1 or more hyphen, underscore, dot or digit
( # start group 3
\. # a dot
.+ # 1 or more any character
\R # any kind of linebreak
) # end group 3
) # end group 1
(?: # non capture group
\2 # backreference to group 2
[-_.\d]+ # 1 or more hyphen, underscore, dot or digit
\3 # backreference to group 3
)+ # end group, must appear 1 or more times
给定示例的结果
./ACT_HERE_REPORT_MEMO_APPROVED_20180510_083000.log.gz
./ACT_HERE_SOMETHING_MEMO_APPROVED_20180326.xls
./Archive/20150919-084501.SOMETHING
./TEST/TEST.20170310.20170310-181017.txt.gz
./TEST/TEST.20180402-011024.log.gz
屏幕截图: