我一直在尝试构建一个正则表达式来执行以下操作:
查找xml标签中包含的“alphabet”字样,搜索将匹配以下内容:
<hw>Al"pha*bet</hw>
<hw>Al"pha*be`t</hw>
<hw>alphabet</hw>
<hw>al*pha*bet</hw>
<hw>al"pha"b"et</hw>
这个单词可以用3个特殊的字符分隔:“*`,搜索必须不区分大小写。你能不能通过建立一个特殊搜索单词字母表的正则表达式帮助我,无论是否有任何上面提到的特殊字符。
答案 0 :(得分:2)
这可以解决,不应该使用正则表达式来解析xml / html等。
捕获简单样本总是更容易,然后在回调中对它们进行子处理 在这种情况下捕获([alphabet“*`,] +),然后去除不需要的字符,然后进行比较。
Perl示例,Perl / PHP / C#等的概念是相同的......
$sample = '
<hw>Al"pha*bet</hw>
<hw>Al"pha*be`t</hw>
<hw>alphabet</hw>
<hw>al*pha*bet</hw>
<hw>al"pha"b"et</hw>
';
$specialword = 'alphabet';
$uc_specialword = uc( $specialword );
while ($sample =~ m{<([A-Za-z_:][\w:.-]*)(?:\s+(?:".*?"|\'.*?\'|[^>]*?)+)?\s*(?<!/)>([$specialword"*`,]+)</\1\s*>}isg)
{
($matchstr, $checkstr) = ($&, $2);
$checkstr =~ s/["*`,]//g;
if (uc($checkstr) eq $uc_specialword) {
print "Found '$checkstr' in '$matchstr'\n";
}
}
扩展正则表达式:
m{ # Regex delim
< # Open tag
([A-Za-z_:][\w:.-]*) # Capture 1, the tag name
(?:\s+(?:".*?"|\'.*?\'|[^>]*?)+)?\s* # optional attr/val pairs
(?<!/)
>
([alphabet"*`,]+) # Capture 2, class of special characters allowed, 'alphabet' plus "*`,
</\1\s*> # Close tag, backref to tag name (group 1)
}xisg # Regex delim. Options: expanded, case insensitive, single line, global
输出:
Found 'Alphabet' in '<hw>Al"pha*bet</hw>'
Found 'Alphabet' in '<hw>Al"pha*be`t</hw>'
Found 'alphabet' in '<hw>alphabet</hw>'
Found 'alphabet' in '<hw>al*pha*bet</hw>'
Found 'alphabet' in '<hw>al"pha"b"et</hw>'
PHP示例
使用preg_match()
可以在http://www.ideone.com/8EBpx
<?php
$sample = '
<hw>Al"pha*bet</hw>
<hw>Al"pha*be`t</hw>
<hw>alphabet</hw>
<hw>al*pha*bet</hw>
<hw>al"pha"b"et</hw>
';
$specialword = 'alphabet';
$uc_specialword = strtoupper( $specialword );
$regex = '~<([A-Za-z_:][\w:.-]*)(?:\s+(?:".*?"|\'.*?\'|[^>]*?)+)?\s*(?<!/)>([' . $specialword. '"*`,]+)</\1\s*>~xis';
$pos = 0;
while ( preg_match($regex, $sample, $matches, PREG_OFFSET_CAPTURE, $pos) )
{
$matchstr = $matches[0][0];
$checkstr = $matches[2][0];
$checkstr = preg_replace( '/[" * `,]/', "", $checkstr);
if ( strtoupper( $checkstr ) == $uc_specialword )
print "Found '$checkstr' in '$matchstr'\n";
$pos = $matches[0][1] + strlen( $matchstr );
}
?>
使用preg_match_all()
可以在http://www.ideone.com/C6HeT
<?php
$sample = '
<hw>Al"pha*bet</hw>
<hw>Al"pha*be`t</hw>
<hw>alphabet</hw>
<hw>al*pha*bet</hw>
<hw>al"pha"b"et</hw>
';
$specialword = 'alphabet';
$uc_specialword = strtoupper( $specialword );
$regex = '~<([A-Za-z_:][\w:.-]*)(?:\s+(?:".*?"|\'.*?\'|[^>]*?)+)?\s*(?<!/)>([' . $specialword. '"*`,]+)</\1\s*>~xis';
preg_match_all($regex, $sample, $matches, PREG_SET_ORDER);
foreach ($matches as $match)
{
$matchstr = $match[0];
$checkstr = $match[2];
$checkstr = preg_replace( '/[" * `,]/', "", $checkstr);
if ( strtoupper( $checkstr ) == $uc_specialword )
print "Found '$checkstr' in '$matchstr'\n";
}
?>
答案 1 :(得分:1)
你可以试试这个
a([`"\*])*l([`"\*])*p([`"\*])*h([`"\*])*a([`"\*])*b([`"\*])*e([`"\*])*t
或者这个
>\s*a([`"\*])*l([`"\*])*p([`"\*])*h([`"\*])*a([`"\*])*b([`"\*])*e([`"\*])*t\s*<
修改
抱歉忘了逃避*
答案 2 :(得分:0)
我得到的一个适用于您列出的案例:
/<[a-zA-Z]+>al"*\**pha\**\"*b\"*e`*t<\/[a-zA-Z]+>/i
结帐http://www.rubular.com/。它有正则表达式的实时测试。