我正在寻找URL中的任何内容,这些内容将是字符和/或数字的不可理解的序列(哈希,编码后的内容,GUID / UUID ... )。我假设所有字符都是[a-zA-Z0-9_-]
类型。所有正匹配项,我将替换为特定的字符串"___ptag___"
。我在数据集中发现了一些示例:
#Positive Matches :
p=["i-IJ0BWanbqz9CojDxJSC3",
"613_18571416_658343_0_12624",
"75ZTZFtTQ15",
"dGhvbWFzLmNvcXVlcnklNDBmcmVlLmZydGh",
"3fd736170ad91c7b59e699fbe7c98e27",
"click8456312324877856p", # even if there is an intelligible word inside, it's a match
"bfjkzahzrfhquchrjghlyuui"]
# negative matches:
n=["sunny_health_fitness_twister_stepper" # words,
"Itunesdotypointappledotypointcom", # words
"2019_01_29", # it looks like a date
"Office-365", # it is intelligible (letters followed by small number of digits)
"id217", # too small, represents something
"No_Ad_t0", # too small
"8461531", # to small
"veri876xx_omg"] # represents something
我想到了使用由4个OR
组组成的模式来为我的规则建模:
包含连续8位以上数字的任何内容,或者
任何超过9个字符的字符,且每2个数字/字符为备用字符,或者
import re
[re.findall(r".*\d{9,}.*|(?=.*(?:[0-9]+[A-Za-z]+){2,}).{9,}|(?=.*(?:[a-z]+[A-Z]+){2,}).{9,}|.*(?=[zrtpqsdfghjklmwxcvbnZRTPQDFGHJKLMWXCVBN]{6,}).*.{9,}", i) for i in p+n]
几乎does the job,但它确实不雅致,恐怕它可能不够健壮,可能需要很长时间才能匹配大型数据集。您是否有更好的想法来实现这一目标? 谢谢;