正则表达式:匹配URL中任何难以理解的字符串

时间:2019-01-31 16:15:00

标签: python regex url

我正在寻找URL中的任何内容,这些内容将是字符和/或数字的不可理解的序列哈希,编码后的内容,GUID / UUID ... )。我假设所有字符都是[a-zA-Z0-9_-]类型。所有正匹配项,我将替换为特定的字符串"___ptag___"。我在数据集中发现了一些示例:

#Positive Matches :
p=["i-IJ0BWanbqz9CojDxJSC3",
"613_18571416_658343_0_12624",
"75ZTZFtTQ15",
"dGhvbWFzLmNvcXVlcnklNDBmcmVlLmZydGh",
"3fd736170ad91c7b59e699fbe7c98e27",
"click8456312324877856p", # even if there is an intelligible word inside, it's a match
"bfjkzahzrfhquchrjghlyuui"]

# negative matches:
n=["sunny_health_fitness_twister_stepper" # words,
"Itunesdotypointappledotypointcom", # words
"2019_01_29", # it looks like a date
"Office-365", # it is intelligible (letters followed by small number of digits)
"id217",      # too small, represents something
"No_Ad_t0",  # too small
"8461531",    # to small
"veri876xx_omg"] # represents something

我想到了使用由4个OR组组成的模式来为我的规则建模:

  • 包含连续8位以上数字的任何内容,或者

  • 任何超过9个字符的字符,且每2个数字/字符为备用字符,或者

  • 任何超过9个字符的字符,并且大写/小写字母交替出现2+,或者
  • 超过9个,连续6个以上的辅音。

import re [re.findall(r".*\d{9,}.*|(?=.*(?:[0-9]+[A-Za-z]+){2,}).{9,}|(?=.*(?:[a-z]+[A-Z]+){2,}).{9,}|.*(?=[zrtpqsdfghjklmwxcvbnZRTPQDFGHJKLMWXCVBN]{6,}).*.{9,}", i) for i in p+n]

几乎does the job,但它确实不雅致,恐怕它可能不够健壮,可能需要很长时间才能匹配大型数据集。您是否有更好的想法来实现这一目标? 谢谢;

0 个答案:

没有答案