我一直在尝试识别网页的网址格式。为此我遵循了以下内容但最终出现了问题
- >使用PHP正则表达式:
% some dummy synthetic data
x = linspace(0, 10, 1e3);
y = exp(-0.3*x) .* sin(x) .* cos(pi*x);
% its derivative (presumably your "acceleration")
yp = diff(y) ./ diff(x);
% Plot data to get an overview
plot(x,y), hold on
% Find zero crossings (product of two consecutive data points is negative)
zero_x = y(1:end-1) .* y(2:end) < 0;
% Use derivative for linear interpolation between those points
x_cross = x(zero_x) + y(zero_x)./yp(zero_x);
% Plot those zeros
plot(x_cross, zeros(size(x_cross)), 'ro')
这已经确定了几乎所有类型的URL,如下所示
~((https?://)?([-\w]+\.[-\w\.]+)+\w(:\d+)?(/([-\w/_\.]*(\?\S+)?)?)*)~i
但不幸的是,它还将十进制值,价格值,电话号码,IP地址视为URL格式(可能我之前没有考虑过)。所以为了解决这个问题,我已经习惯于在下面找到要排除的特定数值模式
example.com
www.example.com
http://example.com
http://www.example.com
https://example.com
https://www.example.com
使用此功能通过排除
等数值来修复URL标识符Deciaml值(1.11)
IP地址(123.123.123.123)
价格($ 11.11)
现在出现了新问题“缩写词也被视为网址”
W.H.O(按字母顺序排列)
那么,我怎样才能有一个识别PHP正则表达式的URL来排除上面提到的问题?
或
我是否可以使用PHP正则表达式来识别涉及缩写的单个字母值,如上例所示?
由于
答案 0 :(得分:0)
您可以将这些排除项置于否定前瞻中并使用
$re = '~(?x)\b # Word boundary
(?! # Exclusion list
[A-Z](?:\.[A-Z])+\b # No upper and 1+ sequences of . + an upper
| # or
\d+(?:\.\d+)+\S+\b # digits + 1+ dot and digits and 1+ non-whitespaces
)
(?:https?://)? # Optional http / https protocol part
(?:[-\w]+\.[-\w.]+)+ # 1+ sequences of 1+ - or word chars, then . and 1+ -, ., or word chars
\w(?::\d+)? # word char and 1 optional sequence of : and 1+ digits
(?:/(?:[-\w/.]*(?:\?\S+)?)?)* # 0+ sequences of /, 0+ -, word, /, . symbols, then 1 optional sequence of ? and 1+ non-whitespaces
\b~'; # word boundary
$str = 'example.com www.example.com http://example.com http://www.example.com https://example.com https://www.example.com Deciaml Values (1.11) IP Address (123.123.123.123) W.H.O Price values ($11.11)';
preg_match_all($re, $str, $matches);
print_r($matches[0]);
在线查看PHP demo和regex demo here。