Question

有一个文本文件作为输入，大小约为5-10mb，其中包含许多部分重复的字符串，我需要找到重复的字符串（长度为min，max）并保存以创建字典用较短的字符串替换它们。

例如输入：

This is just a sample STRING: at the address of "https://example.com/content/1.jpeg" and another image address in another address maybe here https://example.com/content/3242341.jpeg.
And this sample string can be countinue for ever and you can see that there is no structure for the partial strings...

预期输出：

min=4,max=100

$1:this 
$2: sample string
$3: address
$4:https://example.com/content/
$5:.jpeg
$6: another
$7:here 
$8:And 

$1 is just a$2: at the$3 of "$41$5" and$6 image$3 in$6$3 maybe $7$43242341.$5.
$8$1$2 can be countinue for ever and you can see that t$7is no structure for the partial strings...

该示例编写得不太好，但是希望您能理解。我想知道是否有可能做这样的事情还是没有意义？我可以定义一个特殊字符，例如$或使用（__）或任何其他可以指定变量的字符。

（注意：该字符串几乎可以是任何utf-8字符，但我可以保留几个字符）

对算法有任何想法吗？或正则表达式？

在大字符串中查找重复的字符串

0 个答案: