这是我的RSS文件的示例结构:
<item>
<title>My Title</title>
<link>http://www.link.com</link>
<description>The description</description>
<author>Blah Blah</author>
<pubDate>Thu, 26 Jul 2012 10:17:15 -0400</pubDate>
<media:content url="myimage.jpg">
<media:title>sdafsd</media:title>
</media:content>
<position>1</position>
</item>
如何使用PHP正则表达式从文件中完全删除author标签及其内容,整个media:content标签及其内容,以及位置标记及其内容?
谢谢!
答案 0 :(得分:3)
不要使用Regex来解析HTML / XML,那里有非常好的解析器:
<?php
$xml = <<<XML
<item>
<title>My Title</title>
<link>http://www.link.com</link>
<description>The description</description>
<author>Blah Blah</author>
<pubDate>Thu, 26 Jul 2012 10:17:15 -0400</pubDate>
<media:content url="myimage.jpg">
<media:title>sdafsd</media:title>
</media:content>
<position>1</position>
</item>
XML;
$dom = new DOMDocument();
//DOMDocument throws warnings when the XML is invalid, we don't care.
//Though in this case, the media: namespace would be ignored because it's not defined.
@$dom->loadXML($xml);
$document = $dom->documentElement;
//Find the elements you want to remove
$author = $document->getElementsByTagName("author")->item(0);
$content = $document->getElementsByTagName("content")->item(0);
//And remove them.
$document->removeChild($author);
$document->removeChild($content);
//Output the resulting XML.
echo $dom->saveXML();
答案 1 :(得分:1)
我之前的回答是 - 理所当然地 - 删除了,我应该将其添加为评论。以下是DomDocument完全按照您要执行的操作的替代方法:
<?php
$xml = <<<XML
<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0">
<channel>
<title>bla</title>
<link>bla</link>
<description>A description</description>
<language>en-us</language>
<item xmlns:media="http://search.yahoo.com/mrss/">
<title>My Title</title>
<link>http://www.link.com</link>
<description>The description</description>
<author>Blah Blah</author>
<pubDate>Thu, 26 Jul 2012 10:17:15 -0400</pubDate>
<media:content url="myimage.jpg">
<media:title>sdafsd</media:title>
</media:content>
<position>1</position>
</item>
</channel>
</rss>
XML;
$doc = new DOMDocument();
$doc->loadXml( $xml );
foreach( $doc->getElementsByTagName( 'item' ) as $item ) {
$item->removeChild( $item->getElementsByTagName( 'author' )->item( 0 ) );
$item->removeChild( $item->getElementsByTagName( 'position' )->item( 0 ) );
$item->removeChild( $item->getElementsByTagName( 'content' )->item( 0 ) );
}
var_dump( $doc->saveXml( ) );
答案 2 :(得分:0)
免责声明:为了灵活性和可靠性,您应该始终使用适当的解析器(如DOMDocument
)来操作XML / HTML。话虽这么说,如果您确定您的标记格式正确,不受更改结构限制,并且不包含嵌套的重复标记,正则表达式可以解决问题像这样。但是如果你知道自己在做什么,就应该只使用它们。
您需要使用preg_replace()
将每个匹配替换为空字符串(""
)。以下是<author>...</author>
块的完成方式:
$markup = preg_replace('#<author>(.*?)</author>#is', '', $markup);
基本上,这与开始标记<author>
匹配,开头/结尾标记之间的任何内容(或任何内容)与结束标记</author>
匹配。
其他标签可以类似的方式删除。
答案 3 :(得分:0)
$content = file_get_contents($file_name)
$xmlElem = 'author'
$content = preg_replace('#<' . $xmlElem . '(?:\s+[^>]+)?>(.*?)</' . $xmlElem . '>#s', '', $content)
$xmlElem = 'media:content'
$content = preg_replace('#<' . $xmlElem . '(?:\s+[^>]+)?>(.*?)</' . $xmlElem . '>#s', '', $content)
$xmlElem = 'position'
$content = preg_replace('#<' . $xmlElem . '(?:\s+[^>]+)?>(.*?)</' . $xmlElem . '>#s', '', $content)