尝试使用正则表达式匹配xml文档中的单个第一个节点
~<(\S+).*>.*</\1>~
,在文本为特定长度之前,它不匹配任何内容。在一个文档中,在我删除文本直到它是1186个字符之后,正则表达式成功找到了一些东西。在下面的示例中,我删除了文本,直到它只有960个字符,然后正则表达式成功。可以想象,这种看似不一致的行为非常令人困惑。如果发生这种情况,我将不胜感激。
原文:
<?xml version="1.0"?> <catalog> <book id="bk101"> <author>Gambardella, Matthew</author> <title>XML Developer's Guide</title> <genre>Computer</genre> <price>44.95</price> <publish_date>2000-10-01</publish_date> <description>An in-depth look at creating applications with XML.</description> </book> <book id="bk102"> <author>Ralls, Kim</author> <title>Midnight Rain</title> <genre>Fantasy</genre> <price>5.95</price> <publish_date>2000-12-16</publish_date> <description>A former architect battles corporate zombies, an evil sorceress, and her own childhood to become queen of the world.</description> </book> <book id="bk103"> <author>Corets, Eva</author> <title>Maeve Ascendant</title> <genre>Fantasy</genre> <price>5.95</price> <publish_date>2000-11-17</publish_date> <description>After the collapse of a nanotechnology society in England, the young survivors lay the foundation for a new society.</description> </book> <book id="bk104"> <author>Corets, Eva</author> <title>Oberon's Legacy</title> <genre>Fantasy</genre> <price>5.95</price> <publish_date>2001-03-10</publish_date> <description>In post-apocalypse England, the mysterious agent known only as Oberon helps to create a new life for the inhabitants of London. Sequel to Maeve Ascendant.</description> </book> <book id="bk105"> <author>Corets, Eva</author> <title>The Sundered Grail</title> <genre>Fantasy</genre> <price>5.95</price> <publish_date>2001-09-10</publish_date> <description>The two daughters of Maeve, half-sisters, battle one another for control of England. Sequel to Oberon's Legacy.</description> </book> <book id="bk106"> <author>Randall, Cynthia</author> <title>Lover Birds</title> <genre>Romance</genre> <price>4.95</price> <publish_date>2000-09-02</publish_date> <description>When Carla meets Paul at an ornithology conference, tempers fly as feathers get ruffled.</description> </book> <book id="bk107"> <author>Thurman, Paula</author> <title>Splish Splash</title> <genre>Romance</genre> <price>4.95</price> <publish_date>2000-11-02</publish_date> <description>A deep sea diver finds true love twenty thousand leagues beneath the sea.</description> </book> <book id="bk108"> <author>Knorr, Stefan</author> <title>Creepy Crawlies</title> <genre>Horror</genre> <price>4.95</price> <publish_date>2000-12-06</publish_date> <description>An anthology of horror stories about roaches, centipedes, scorpions and other insects.</description> </book> <book id="bk109"> <author>Kress, Peter</author> <title>Paradox Lost</title> <genre>Science Fiction</genre> <price>6.95</price> <publish_date>2000-11-02</publish_date> <description>After an inadvertant trip through a Heisenberg Uncertainty Device, James Salway discovers the problems of being quantum.</description> </book> <book id="bk110"> <author>O'Brien, Tim</author> <title>Microsoft .NET: The Programming Bible</title> <genre>Computer</genre> <price>36.95</price> <publish_date>2000-12-09</publish_date> <description>Microsoft's .NET initiative is explored in detail in this deep programmer's reference.</description> </book> <book id="bk111"> <author>O'Brien, Tim</author> <title>MSXML3: A Comprehensive Guide</title> <genre>Computer</genre> <price>36.95</price> <publish_date>2000-12-01</publish_date> <description>The Microsoft MSXML3 parser is covered in detail, with attention to XML DOM interfaces, XSLT processing, SAX and more.</description> </book> <book id="bk112"> <author>Galos, Mike</author> <title>Visual Studio 7: A Comprehensive Guide</title> <genre>Computer</genre> <price>49.95</price> <publish_date>2001-04-16</publish_date> <description>Microsoft Visual Studio 7 is explored in depth, looking at how Visual Basic, Visual C++, C#, and ASP+ are integrated into a comprehensive development environment.</description> </book> </catalog>
修剪(成功)文字:
<?xml version="1.0"?> <catalog> <book id="bk101"> <author>Gambardella, Matthew</author> <title>XML Developer's Guide</title> <genre>Computer</genre> <price>44.95</price> <publish_date>2000-10-01</publish_date> <description>An in-depth look at creating applications with XML.</description> </book> <book id="bk102"> <author>Ralls, Kim</author> <title>Midnight Rain</title> <genre>Fantasy</genre> <price>5.95</price> <publish_date>2000-12-16</publish_date> <description>A former architect battles corporate zombies, an evil sorceress, and her own childhood to become queen of the world.</description> </book> <book id="bk103"> <author>Corets, Eva</author> <title>Maeve Ascendant</title> <genre>Fantasy</genre> <price>5.95</price> <publish_date>2000-11-17</publish_date> <description>After the collapse of a nanotechnology society in England, the young survivors lay the foundation for a new society.</description> </book> <book id="bk104"> <author>Core</catalog>
我为文本的格式化道歉,但我不想在数据中添加某些东西以使其对其他人(例如新行字符)的行为有所不同。
编辑:我一直在使用this网站测试正则表达式。
答案 0 :(得分:2)
函数preg_match()
- 与许多其他PHP函数类似 - 具有返回值。
根据返回值的不同,您可以决定脚本应该如何继续。
在您遇到的情况下,您错过了实际检查返回值为FALSE
的情况。因为 - 如您的示例所示,它是FALSE
。
阅读手册表明FALSE
的返回值表示错误。您可以通过调用提供最后错误代码的函数preg_last_error()
来了解有关该错误的更多信息。所以你可以learn about the error your call to preg_match()
gives:
int(2) - PREG_BACKTRACK_LIMIT_ERROR
参见:
答案 1 :(得分:1)
您可以使用约束字符类更好地控制量词:
懒惰量词的例子:
$pattern = '~<([^>\s]++)[^>]*+>.*?</\1>~';
只有占有量词(更好)的例子:
$pattern = '~<([^>\s]++)[^>]*+>(?>[^<]++|<(?!/\1>))+</\1>~';
但是这两种模式不涉及嵌套结构,为此必须使用:
$pattern = '~<([^>/\s]++)[^>]*+>(?>[^<]++|(?R))*</\1>~';
第二种模式:(?>[^<]++|<(?!/\1>))+
(?> # open an atomic group
[^<]++ # all characters but < one or more times (possessive)
| # OR
<(?!/\1) # < not followed by / and the content of the first backreference
# (the tag name here)
)+ # close the atomic group and repeat one or more times
这样做的目的是在</\1>
之前匹配所有内容,其目的是匹配所有非<
或全部<
未跟/tagname>
有关possessive quantifiers和atomic groups的更多信息。
第三种模式:递归模式
<
([^>/\s]++) # tagname,
# note that you must exclude the / to avoid closing tags
[^>]*+ # leading characters in the tag
>
(?> # open an atomic group
[^<]++ # all characters but <, one or more times (possessive)
| # OR
(?R) # repeat the whole pattern
)* # close the atomic group, repeat zero or more times
</\1> # close tag with the first back reference
答案 2 :(得分:-1)
嗯,首先 - 一般的态度是不应该使用RegEx解析XML。如果可能,请使用SimpleXML。 正如尼克所说,太贪心了......