我有一个数据库,里面装满了可怕的计算机生成的HTML,并且散落着不同的样式信息...样式属性,字体标签,背景属性...
我必须重新设计网站,但首先我需要从产品说明中删除所有样式。在有人建议手动完成之前,有100,000种产品。我认为PHP中的一些创造性正则表达式可能会成功。
理想情况下,我想删除所有HTML并且只有纯文本,但描述中包含表格的表格和表格......所以这只会以泪水结束。
期待您的创意解决方案:)
编辑 -
第二个想法我也可以在VBA中这样做,因为我可以将它们导出到excel表。所以PHP或VBA解决方案会很棒。
编辑 -
<div class="XXXX-template-06">
<table border="0" cellpadding="0" cellspacing="0" style="border-collapse: collapse" bordercolor="#111111" width="694" id="AutoNumber1">
<tbody><tr>
<td width="516" height="18" bgcolor="#999966" align="center">
<p align="center"><font face="Verdana" color="#FFFFFF"><b>Mont Blanc Scott Roof mounted cycle bike carrier<br>
<br>
Part Number: 728540</b></font></p></td>
<td width="178" height="18" bgcolor="#999966" align="center">
<a href="/shippingcalculator.html?SKU=728540" target="_blank"><img border="0" src="http://images.ZZZZpro.com/2145/" width="88" height="33"></a></td>
</tr>
<tr>
<td width="694" height="57" bgcolor="#CCCC99" align="center" colspan="2">
<b><font face="Verdana" size="2" class="CustomStyle-CycleCarrier">
<script type="text/javascript">
<!--function click() { if (event.button==2) { alert('All graphics, descriptions and other information, including the HTML code of this listing are the property of XXXX Limited and may not be reproduced in any form without the express permission of XXXX Limited. Email us: sales@XXXX.com'); } } document.onmousedown=click // -->
<!---->
<!---->
<!---->
<!---->
<!---->
<!---->
<!---->
<!---->
<!---->
<!---->
<!---->
<!----> -->
</script>
<div align="center">
<center>
<table height="336" background="http://images.ZZZZpro.com/2145/I/21/fade1.jpg" width="680" border="0">
<tbody><tr>
<td height="49" width="136"><p align="center"><img height="62" src="http://XXXXbiz.ipage.com/XXXX/Images/Mont%20Blanc/montblanc.jpg" width="165" border="0"></p></td>
<td height="49" width="378"><p align="center"><font face="Verdana" color="#0000ff" size="5"><u><strong>Mont Blanc </strong></u></font><u><strong><font face="Verdana" color="#0000FF" size="5">Scott Roof Bar Rack 1 Cycle Carrier</font></strong></u></p></td>
<td height="49" width="146"><img height="69" src="http://images.ZZZZpro.com/2145/I/20/logomed.gif" width="174" border="0"></td>
</tr>
<tr>
<td height="241" colspan="3" width="672"><hr><p align="center"><img height="223" src="http://XXXXbiz.ipage.com/XXXX/Images/Mont%20Blanc/scottlrg.jpg" width="237" border="0"></p><p><font color="black"><b>Scott</b> </font></p><ul><li>Stylish, easy to use roof mounted cycle carrier, distinctive oval carrying bar.<br></li><li>Extra Soft Frame clamps hold cycle safely and gently<br></li><li>Extra wide wheel holders take the fattest tyres<br></li><li>Strong Webbing straps fasten wheels securely to carrier<br></li><li><font size="3" color="black">Upright, roof bar mounted, locking cycle carrier<br></font></li><li><font size="3" color="black"> Locks to roof rails and locks bikes<br></font></li><li><font size="3" color="black"> Quick and easy to use<br></font></li><li><font size="3" color="black">Adjustable for most cycle styles</font></li></ul><center><table cellspacing="0" width="100%" cellpadding="20" border="0" height="1" class="featuretable">
<tbody><tr>
<td height="55" class="featuretd" width="110"><p align="center"><a target="_blank" href="http://www.montblancuk.co.uk/support/inst/scott.pdf"><img width="20" alt="Open document" src="http://espimages.biz/2145/I/20/mount_link.gif" border="0" height="20"></a></p></td>
<td height="55" class="featuretd">To view Fitting Instructions in PDF format please click the spanner</td>
</tr>
</tbody></table>
<table height="317">
<tbody><tr class="technicaltr" valign="top">
<td height="1" class="technicalfirstcolumn"><font class="technicalheader">Technical data</font></td>
<td height="1" class="technicalsecondcolumn"><p><font class="heading1">Mont </font>Blanc Scott</p><p align="center"><img height="107" src="http://XXXXbiz.ipage.com/XXXX/Images/Mont%20Blanc/scottfaint.jpg" width="127" border="0"></p></td>
</tr>
<tr class="technicaltr" valign="top">
<td height="21" class="technicalfirstcolumn"><div>Max number of bikes</div></td>
<td height="21" class="technicalsecondcolumn"><div>1</div></td>
</tr>
<tr class="technicaltr" valign="top">
<td height="18" class="technicalfirstcolumn"><div>Load capacity (kg)</div></td>
<td height="18" class="technicalsecondcolumn"><div>15 KG</div></td>
</tr>
<tr class="technicaltr" valign="top">
<td height="21" class="technicalfirstcolumn"><div>Weight (kg)</div></td>
<td height="21" class="technicalsecondcolumn"><div>2.2KG</div></td>
</tr>
<tr class="technicaltr" valign="top">
<td height="21" class="technicalfirstcolumn"><div>Fits frame-dimensions (mm)</div></td>
<td height="21" class="technicalsecondcolumn">Up to 80mm</td>
</tr>
<tr class="technicaltr" valign="top">
<td height="21" class="technicalfirstcolumn"><div>Fits wheel-dimensions</div></td>
<td height="21" class="technicalsecondcolumn"><div>All</div></td>
</tr>
<tr class="technicaltr" valign="top">
<td height="21" class="technicalfirstcolumn"><div>Locks bikes to carrier</div></td>
<td height="21" class="technicalsecondcolumn"><div>Yes</div></td>
</tr>
<tr class="technicaltr" valign="top">
<td height="21" class="technicalfirstcolumn"><div>Locks carrier to car</div></td>
<td height="21" class="technicalsecondcolumn"><div>Yes</div></td>
</tr>
<tr class="technicaltr" valign="top">
<td height="21" class="technicalfirstcolumn"><div>Tilt function, with bikes</div></td>
<td height="21" class="technicalsecondcolumn"><div>NA</div></td>
</tr>
<tr class="technicaltr" valign="top">
<td height="21" class="technicalfirstcolumn"><div>TÜV/EuroBE approved</div></td>
<td height="21" class="technicalsecondcolumn"><div>NA</div></td>
</tr>
<tr class="technicaltr" valign="top">
<td height="21" class="technicalfirstcolumn"><div>Fullfills City Crash norms</div></td>
<td height="21" class="technicalsecondcolumn"><div>NA</div></td>
</tr>
<tr class="technicaltr" valign="top">
<td height="21" class="technicalfirstcolumn"><div>Miscellaneous</div></td>
<td height="21" class="technicalsecondcolumn"><div><p>Fits all types of Roof Bars,</p></div></td>
</tr>
</tbody></table>
<p align="center">
<font size="2" face="Verdana">The cycle carrier is
guaranteed for Five year from date of purchase.
<br>
<br>We stock a wide range of towbars and towing accessories.
<a href="mailto:sales@XXXX.com?subject=Witter ZX88 Cycle Carrier"><br>Click
here to email us</a> if you require details of our other
towing equipment.</font>
</p>
<hr>
</center>
</td>
</tr>
</tbody></table>
</center>
</div>
<br>
Please note that with the Type of cycle carrier where you mount it
<br>
onto a flange ball you may need the long reach ball which will <br>
allow you enough clearance from the bumper</font></b></td>
</tr>
<tr>
<td width="694" height="57" bgcolor="#CCCC99" align="center" colspan="2">
<a href="http://www.XXXXeuro.ZZZZprostorefront.co.uk/products/728540-mont-blanc-scott-roof-mounted-cycle-bike-carrier-728540.html" target="_blank"><img border="0" src="http://images.ZZZZpro.com/2145/" width="55" height="40"></a>
<b><font face="Verdana" size="2">Not from the UK ? Click the flag
to purchase this item from our EU site </font></b><a href="http://www.XXXXeuro.ZZZZprostorefront.co.uk/products/728540-mont-blanc-scott-roof-mounted-cycle-bike-carrier-728540.html" target="_blank"><img border="0" src="http://images.ZZZZpro.com/2145/" width="57" height="40"></a></td>
</tr>
</tbody></table>
</div>
编辑 -
仔细观察我认为我需要摆脱以下几点:
Atrributes: 样式 BGCOLOR 背景
标签: 字体
答案 0 :(得分:3)
我建议使用XSLT去除所有不需要的内容。一个简单的身份模板将是一个很好的起点。
答案 1 :(得分:0)
php的strip_tags功能怎么样?
令人讨厌的部分是你必须传递要保留在数组中的每个标记,但你只需要编写一次。
用于删除标记属性,bgcolor等。有人制作了这个函数here,值得一看,但请注意该页面上的狡猾的双引号。底部有一个链接,可以在没有wordpress格式的情况下下载代码。
答案 2 :(得分:0)
感谢@ Paul的想法,这是Excel中的一个例子。这非常粗糙,还需要根据您在Excel中存储HTML的 方式进行修改;但希望它会让你开始。
这个例子假设了一些事情:
您首先安装了TidyATL COM object(单击显示'wrapper'的链接;您可以先将DLL注册到C:\ Windows \ SysWOW64,然后在64位Win 7上注册并运行regsvr32 C:\ Windows \ SysWOW64 \ TidyATL.dll)。
您的Excel项目引用了Microsoft XML 6.0和Tidy 1.0类型库
您的HTML存储在工作表1的单元格A1中。结果将放入单元格B1中。您可以轻松扩展此想法,以迭代列中所有已使用的单元格,并立即处理所有HTML。
我没有编写XSLT的经验。我直接从here撕掉了'身份模板'。我从来没有在今天之前使用XSLT;所以也许知道它的人可以编辑XSLT以去除<font>
节点。此示例仅删除所有属性。
这使用Tidy HTML将丑陋的HTML转换为XHTML,然后将XSLT模板应用于结果。
编辑:抱歉,搞砸了XSLT中的“匹配”属性。是:match ='@ * | node()'应该是:match ='node()'
这是我使用的代码:
Sub TidyUp()
Dim t As TidyATL.TidyDocument
Dim sXSLT
sXSLT = "<?xml version='1.0' encoding='ISO-8859-1'?>" & _
"<xsl:stylesheet version='1.0' xmlns:xsl='http://www.w3.org/1999/XSL/Transform'>" & _
"<xsl:template match='node()'>" & _
" <xsl:copy>" & _
" <xsl:apply-templates select='node()'/>" & _
" </xsl:copy>" & _
"</xsl:template>" & _
"</xsl:stylesheet>"
Set t = New TidyATL.TidyDocument
t.ParseString Sheet1.Range("A1").Value
t.SetOptBool TidyXmlOut, True
t.SetOptBool TidyXhtmlOut, True
t.SetOptBool TidyNumEntities, True
t.SetOptBool TidyXmlDecl, True
t.CleanAndRepair
Dim x As MSXML2.DOMDocument
Dim x2 As MSXML2.FreeThreadedDOMDocument
Dim xe As MSXML2.IXMLDOMParseError
Set x = New MSXML2.DOMDocument
Set x2 = New MSXML2.FreeThreadedDOMDocument
'Load XHTML into a DOM
x.LoadXML t.SaveString
Set xe = x.parseError
If xe.ErrorCode <> 0 Then
MsgBox "Err: " & xe.reason
End
End If
'Load XSLT into a DOM
x2.LoadXML sXSLT
Set xe = x2.parseError
If xe.ErrorCode <> 0 Then
MsgBox "Err: " & xe.reason
End
End If
Dim xt As XSLTemplate
Set xt = New XSLTemplate
Set xt.stylesheet = x2
Dim xp As IXSLProcessor
Set xp = xt.createProcessor
xp.input = x
xp.transform
Sheet1.Range("B1").Value = xp.output
End Sub
这是结果(仍然很丑但没有属性):
<?xml version="1.0" encoding="UTF-16"?><html xmlns="http://www.w3.org/1999/xhtml"><head><meta></meta><title></title></head><body><div><table><tbody><tr><td><p><font><b>Mont
Blanc Scott Roof mounted cycle bike carrier<br></br><br></br>
Part Number: 728540</b></font></p></td><td><a><img></img></a></td></tr><tr><td><b><font><script>
//
<!--function click() { if (event.button==2) { alert('All graphics, descriptions and other information, including the HTML code of this listing are the property of XXXX Limited and may not be reproduced in any form without the express permission of XXXX Limited. Email us: sales@XXXX.com'); } } document.onmousedown=click // -->
<!---->
<!---->
<!---->
<!---->
<!---->
<!---->
<!---->
<!---->
<!---->
<!---->
<!---->
<!----> -->
//</script></font></b><div><center><table><tbody><tr><td><p><img></img></p></td><td><p><font><u><strong>Mont Blanc</strong></u></font><u><strong><font>Scott Roof
Bar Rack 1 Cycle Carrier</font></strong></u></p></td><td><img></img></td></tr><tr><td><hr></hr><p><img></img></p><p><font><b>Scott</b></font></p><ul><li>Stylish, easy to use roof mounted cycle carrier, distinctive
oval carrying bar.<br></br></li><li>Extra Soft Frame clamps hold cycle safely and gently<br></br></li><li>Extra wide wheel holders take the fattest tyres<br></br></li><li>Strong Webbing straps fasten wheels securely to
carrier<br></br></li><li><font>Upright, roof bar mounted, locking
cycle carrier<br></br></font></li><li><font> Locks to roof rails and
locks bikes<br></br></font></li><li><font> Quick and easy to
use<br></br></font></li><li><font>Adjustable for most cycle
styles</font></li></ul><center><table><tbody><tr><td><p><a><img></img></a></p></td><td>To view Fitting Instructions in
PDF format please click the spanner</td></tr></tbody></table><table><tbody><tr><td><font>Technical data</font></td><td><p><font>Mont</font> Blanc Scott</p><p><img></img></p></td></tr><tr><td><div>Max number of bikes</div></td><td><div>1</div></td></tr><tr><td><div>Load capacity (kg)</div></td><td><div>15 KG</div></td></tr><tr><td><div>Weight (kg)</div></td><td><div>2.2KG</div></td></tr><tr><td><div>Fits frame-dimensions (mm)</div></td><td>Up to 80mm</td></tr><tr><td><div>Fits wheel-dimensions</div></td><td><div>All</div></td></tr><tr><td><div>Locks bikes to carrier</div></td><td><div>Yes</div></td></tr><tr><td><div>Locks carrier to car</div></td><td><div>Yes</div></td></tr><tr><td><div>Tilt function, with bikes</div></td><td><div>NA</div></td></tr><tr><td><div>TÜV/EuroBE approved</div></td><td><div>NA</div></td></tr><tr><td><div>Fullfills City Crash norms</div></td><td><div>NA</div></td></tr><tr><td><div>Miscellaneous</div></td><td><div><p>Fits all types of Roof Bars,</p></div></td></tr></tbody></table><p><f
ont>The cycle carrier
is guaranteed for Five year from date of purchase.<br></br><br></br>
We stock a wide range of towbars and towing accessories.
<a><br></br>
Click here to email us</a> if you require details of our other
towing equipment.</font></p><hr></hr></center></td></tr></tbody></table></center></div><b><br></br>
Please note that with the Type of cycle carrier where you mount
it<br></br>
onto a flange ball you may need the long reach ball which
will<br></br>
allow you enough clearance from the bumper</b></td></tr><tr><td><a><img></img></a><b><font>Not from the UK ? Click
the flag to purchase this item from our EU site</font></b><a><img></img></a></td></tr></tbody></table></div></body></html>
编辑:这个XSLT似乎可以解决问题;它会删除一些带有其内容的标记,以及一些没有其内容的标记,无论您指定哪个。也许具有一些XSLT知识的人可以详细说明。
<?xml version='1.0' encoding='ISO-8859-1'?>
<xsl:stylesheet version='1.0' xmlns:xsl='http://www.w3.org/1999/XSL/Transform' xmlns:xhtml="http://www.w3.org/1999/xhtml" >
<xsl:template match='node()|@*'>
<xsl:copy>
<xsl:apply-templates select='node()'/>
</xsl:copy>
</xsl:template>
<!--these tags will be removed with their content-->
<xsl:template match='xhtml:script|xhtml:head'/>
<!--these tags will be removed but keep their content-->
<xsl:template match='xhtml:font|xhtml:p|xhtml:b|xhtml:u|xhtml:i|xhtml:center|xhtml:a|xhtml:img|xhtml:strong'><xsl:apply-templates/></xsl:template>
</xsl:stylesheet>
结果:
<?xml version="1.0" encoding="UTF-16"?><html xmlns="http://www.w3.org/1999/xhtml"><body><div><table><tbody><tr><td>Mont
Blanc Scott Roof mounted cycle bike carrier<br></br><br></br>
Part Number: 728540</td><td></td></tr><tr><td><div><table><tbody><tr><td></td><td>Mont BlancScott Roof
Bar Rack 1 Cycle Carrier</td><td></td></tr><tr><td><hr></hr>Scott<ul><li>Stylish, easy to use roof mounted cycle carrier, distinctive
oval carrying bar.<br></br></li><li>Extra Soft Frame clamps hold cycle safely and gently<br></br></li><li>Extra wide wheel holders take the fattest tyres<br></br></li><li>Strong Webbing straps fasten wheels securely to
carrier<br></br></li><li>Upright, roof bar mounted, locking
cycle carrier<br></br></li><li> Locks to roof rails and
locks bikes<br></br></li><li> Quick and easy to
use<br></br></li><li>Adjustable for most cycle
styles</li></ul><table><tbody><tr><td></td><td>To view Fitting Instructions in
PDF format please click the spanner</td></tr></tbody></table><table><tbody><tr><td>Technical data</td><td>Mont Blanc Scott</td></tr><tr><td><div>Max number of bikes</div></td><td><div>1</div></td></tr><tr><td><div>Load capacity (kg)</div></td><td><div>15 KG</div></td></tr><tr><td><div>Weight (kg)</div></td><td><div>2.2KG</div></td></tr><tr><td><div>Fits frame-dimensions (mm)</div></td><td>Up to 80mm</td></tr><tr><td><div>Fits wheel-dimensions</div></td><td><div>All</div></td></tr><tr><td><div>Locks bikes to carrier</div></td><td><div>Yes</div></td></tr><tr><td><div>Locks carrier to car</div></td><td><div>Yes</div></td></tr><tr><td><div>Tilt function, with bikes</div></td><td><div>NA</div></td></tr><tr><td><div>TÜV/EuroBE approved</div></td><td><div>NA</div></td></tr><tr><td><div>Fullfills City Crash norms</div></td><td><div>NA</div></td></tr><tr><td><div>Miscellaneous</div></td><td><div>Fits all types of Roof Bars,</div></td></tr></tbody></table>The cycle carrier
is guaranteed for Five year from date of purchase.<br></br><br></br>
We stock a wide range of towbars and towing accessories.
<br></br>
Click here to email us if you require details of our other
towing equipment.<hr></hr></td></tr></tbody></table></div><br></br>
Please note that with the Type of cycle carrier where you mount
it<br></br>
onto a flange ball you may need the long reach ball which
will<br></br>
allow you enough clearance from the bumper</td></tr><tr><td>Not from the UK ? Click
the flag to purchase this item from our EU site</td></tr></tbody></table></div></body></html>
答案 3 :(得分:-1)
这个正则表达式应该给你预期的结果,但我没有测试过它:
preg_replace('/(<.*)(style=\".*\")(.*>)/', '{$1}{$3}', $yourhtml);
答案 4 :(得分:-1)
我认为所需的正则表达式可能比你想象的要简单得多,但话说再说一遍,我不知道产品描述是什么样的。有什么机会遇到&lt;和&gt;在描述中,除了作为HTML标签的一部分?如果机会非常小,那么这样的事情不会成功吗?
$new_description = preg_replace('/<([\w_ '"])+>/', '', $description);