我需要将长字符串拆分为具有以下约束的数组:
<a href='test.html'>
到<a href='test
。和html'>
)。这意味着HTML标签应该完好无损。 但是,开始标记和结束标记可以保留在不同的段/块。我认为使用preg_split的正则表达式可以做到这一点。请帮助我使用正确的RegEx。除了正则表达式之外的任何解决方案也欢迎。
谢谢
萨迪
答案 0 :(得分:1)
$parts = preg_split("/(?<!<[^>]*)\./", $input);
但是php不允许非固定长度的lookbehind,所以这不起作用。显然,唯一的两个是jgsoft和.net regexp。 Useful Page
我处理这个问题的方法是:
function splitStringUp($input, $maxlen) {
$parts = explode(".", $input);
$i = 0;
while ($i < count($parts)) {
if (preg_match("/<[^>]*$/", $parts[$i])) {
array_splice($parts, $i, 2, $parts[$i] . "." . $parts[$i+1]);
} else {
if ($i < (count($parts) - 1) && strlen($parts[$i] . "." . $parts[$i+1]) < $maxlen) {
array_splice($parts, $i, 2, $parts[$i] . "." . $parts[$i+1]);
} else {
$i++;
}
}
}
return $parts;
}
当你的单个句子长度超过8000个字符时,你没有提到你想要发生什么,所以这只会让它们完整无缺。
示例输出:
splitStringUp("this is a sentence. this is another sentence. this is an html <a href=\"a.b.c\">tag. and the closing tag</a>. hooray", 8000);
array(1) {
[0]=> string(114) "this is a sentence. this is another sentence. this is an html <a href="a.b.c">tag. and the closing tag</a>. hooray"
}
splitStringUp("this is a sentence. this is another sentence. this is an html <a href=\"a.b.c\">tag. and the closing tag</a>. hooray", 80);
array(2) {
[0]=> string(81) "this is a sentence. this is another sentence. this is an html <a href="a.b.c">tag"
[1]=> string(32) " and the closing tag</a>. hooray"
}
splitStringUp("this is a sentence. this is another sentence. this is an html <a href=\"a.b.c\">tag. and the closing tag</a>. hooray", 40);
array(4) {
[0]=> string(18) "this is a sentence"
[1]=> string(25) " this is another sentence"
[2]=> string(36) " this is an html <a href="a.b.c">tag"
[3]=> string(32) " and the closing tag</a>. hooray"
}
splitStringUp("this is a sentence. this is another sentence. this is an html <a href=\"a.b.c\">tag. and the closing tag</a>. hooray", 0);
array(5) {
[0]=> string(18) "this is a sentence"
[1]=> string(25) " this is another sentence"
[2]=> string(36) " this is an html <a href="a.b.c">tag"
[3]=> string(24) " and the closing tag</a>"
[4]=> string(7) " hooray"
}
答案 1 :(得分:0)
不幸的是,html是不规则的语言,意味着你无法用一个正则表达式解析它。另一方面,如果输入始终相似,或者您只需要解析某些部分,那就没有问题了。对此正则表达式的迭代生成元素名称及其内容:
'~<(?P<element>\s+)(?P<attributes>[^>]*)>(?:(?P<content>.*?)</\s+>)?~'