Question

我一直在尝试识别网页的网址格式。为此我遵循了以下内容但最终出现了问题

- ＆GT;使用PHP正则表达式：

% some dummy synthetic data
x  = linspace(0, 10, 1e3);
y  = exp(-0.3*x) .* sin(x) .* cos(pi*x);

% its derivative (presumably your "acceleration")
yp = diff(y) ./ diff(x);

% Plot data to get an overview 
plot(x,y), hold on


% Find zero crossings (product of two consecutive data points is negative)
zero_x = y(1:end-1) .* y(2:end) < 0;

% Use derivative for linear interpolation between those points
x_cross = x(zero_x) + y(zero_x)./yp(zero_x);

% Plot those zeros
plot(x_cross, zeros(size(x_cross)), 'ro')

这已经确定了几乎所有类型的URL，如下所示

~((https?://)?([-\w]+\.[-\w\.]+)+\w(:\d+)?(/([-\w/_\.]*(\?\S+)?)?)*)~i

但不幸的是，它还将十进制值，价格值，电话号码，IP地址视为URL格式（可能我之前没有考虑过）。所以为了解决这个问题，我已经习惯于在下面找到要排除的特定数值模式

example.com
www.example.com
http://example.com
http://www.example.com    
https://example.com
https://www.example.com

使用此功能通过排除

等数值来修复URL标识符

Deciaml值（1.11）

IP地址（123.123.123.123）

价格（$ 11.11）

现在出现了新问题“缩写词也被视为网址”

W.H.O（按字母顺序排列）

那么，我怎样才能有一个识别PHP正则表达式的URL来排除上面提到的问题？

或

我是否可以使用PHP正则表达式来识别涉及缩写的单个字母值，如上例所示？

由于

Answer 1

您可以将这些排除项置于否定前瞻中并使用

$re = '~(?x)\b                   # Word boundary
   (?!                           # Exclusion list
     [A-Z](?:\.[A-Z])+\b         # No upper and 1+ sequences of . + an upper
     |                           # or
     \d+(?:\.\d+)+\S+\b          # digits + 1+ dot and digits and 1+ non-whitespaces
   )       
   (?:https?://)?                # Optional http / https protocol part
   (?:[-\w]+\.[-\w.]+)+          # 1+ sequences of 1+ - or word chars, then . and 1+ -, ., or word chars
   \w(?::\d+)?                   # word char and 1 optional sequence of : and 1+ digits
   (?:/(?:[-\w/.]*(?:\?\S+)?)?)* # 0+ sequences of /, 0+ -, word, /, . symbols, then 1 optional sequence of ? and 1+ non-whitespaces
   \b~';                         # word boundary
$str = 'example.com  www.example.com  http://example.com http://www.example.com     https://example.com https://www.example.com  Deciaml Values (1.11)  IP Address (123.123.123.123)   W.H.O   Price values ($11.11)';
preg_match_all($re, $str, $matches);
print_r($matches[0]);

在线查看PHP demo和regex demo here。

PHP正则表达式用于标识特定的URL模式

1 个答案: