有一个文本文件作为输入,大小约为5-10mb,其中包含许多部分重复的字符串,我需要找到重复的字符串(长度为min,max)并保存以创建字典用较短的字符串替换它们。
例如输入:
This is just a sample STRING: at the address of "https://example.com/content/1.jpeg" and another image address in another address maybe here https://example.com/content/3242341.jpeg.
And this sample string can be countinue for ever and you can see that there is no structure for the partial strings...
预期输出:
min=4,max=100
$1:this
$2: sample string
$3: address
$4:https://example.com/content/
$5:.jpeg
$6: another
$7:here
$8:And
$1 is just a$2: at the$3 of "$41$5" and$6 image$3 in$6$3 maybe $7$43242341.$5.
$8$1$2 can be countinue for ever and you can see that t$7is no structure for the partial strings...
该示例编写得不太好,但是希望您能理解。 我想知道是否有可能做这样的事情还是没有意义? 我可以定义一个特殊字符,例如$或使用(__)或任何其他可以指定变量的字符。
(注意:该字符串几乎可以是任何utf-8字符,但我可以保留几个字符)
对算法有任何想法吗?或正则表达式?