我有一个客户用word生成他的新闻稿,然后将HTML复制到MailChimp发送出去。
Word拥有所有奇怪而精彩的格式化创意,其中大部分都需要保留,因此格式与他使用的内容一致,以及他所看到的内容。
唯一真正的问题是MS Word如何插入图片,这里是一个snippit,它添加了< img>标签和< v:shape>< v:imagedata> tag:
<td width=640 style='width:480.0pt;border-top:solid #1F497D 1.0pt;mso-border-top-themecolor: text2;border-left:none;border-bottom:solid #1F497D 1.0pt;mso-border-bottom-themecolor: text2;border-right:none;background:#1F497D;mso-background-themecolor:text2; padding:0cm 0cm 0cm 0cm;height:26.6pt'>
<p class=MsoNormal align=center style='text-align:center'><b style='mso-bidi-font-weight:normal'><span style='font-family:"Arial","sans-serif"; mso-ansi-language:EN-NZ;mso-fareast-language:EN-NZ;mso-no-proof:yes'><!--[if gte vml 1]><v:shapetype id="_x0000_t75" coordsize="21600,21600" o:spt="75" o:preferrelative="t" path="m@4@5l@4@11@9@11@9@5xe" filled="f" stroked="f">
<v:stroke joinstyle="miter"/>
<v:formulas>
<v:f eqn="if lineDrawn pixelLineWidth 0"/>
<v:f eqn="sum @0 1 0"/>
<v:f eqn="sum 0 0 @1"/>
<v:f eqn="prod @2 1 2"/>
<v:f eqn="prod @3 21600 pixelWidth"/>
<v:f eqn="prod @3 21600 pixelHeight"/>
<v:f eqn="sum @0 0 1"/>
<v:f eqn="prod @6 1 2"/>
<v:f eqn="prod @7 21600 pixelWidth"/>
<v:f eqn="sum @8 21600 0"/>
<v:f eqn="prod @7 21600 pixelHeight"/>
<v:f eqn="sum @10 21600 0"/>
</v:formulas>
<v:path o:extrusionok="f" gradientshapeok="t" o:connecttype="rect"/>
<o:lock v:ext="edit" aspectratio="t"/>
</v:shapetype><v:shape id="_x0000_i1033" type="#_x0000_t75" style='width:479.25pt;height:112.5pt;visibility:visible;mso-wrap-style:square'>
<v:imagedata src="22nd%20September%20-%20Take%205...%20Your%205%20minute%20fortnightly%20roundup%20of%20alcohol%20and%20other%20drug%20news%20and%20research%202_files/image001.png" o:title=""/>
</v:shape><![endif]--><![if !vml]><img border=0 width=639 height=150 src="22nd%20September%20-%20Take%205...%20Your%205%20minute%20fortnightly%20roundup%20of%20alcohol%20and%20other%20drug%20news%20and%20research%202_files/image025.png"v:shapes="_x0000_i1033"><![endif]></span></b><b style='mso-bidi-font-weight:normal'><span lang=EN-GB style='font-family:"Arial","sans-serif"'><o:p></o:p></span></b></p>
</td>
如果我删除所有MS代码,它会杀死所有格式:
$parsed_html = preg_replace('/<!--\[[\s\S]*?\]-->/s', '', $html);
我试图更具体:
$parsed_html = preg_replace('/<!--\[if gte vml 1\]*?--><!\[if !vml\]>/s', '', $html);
但这工作正常,但再次剥离太多了。 你知道是否有一种方法可以输出更好的HTL(哈哈)或更好的匹配模式。
这是一个完整的HTML文档: http://pastebin.com/myPwnHbd
到目前为止这是PHP(从简单的HTML表单上传html文件): http://pastebin.com/Wc7hEk7c
答案 0 :(得分:0)
谢谢,那个帖子指向: http://htmlpurifier.org/
我的最终代码(摘要)是:
<?php
error_reporting(0); ini_set('display_errors', FALSE);
require_once 'htmlpurifier-4.8.0/library/HTMLPurifier.auto.php';
$html = file_get_contents($_FILES['file']['tmp_name']);
$config = HTMLPurifier_Config::createDefault();
$config->set('Core.Encoding', 'ISO-8859-1');
$config->set('AutoFormat.AutoParagraph', true);
$purifier = new HTMLPurifier($config);
$clean_html = $purifier->purify( $html );
echo $clean_html;