仅限PHP Preg_replace MS Word图像标记

时间:2016-11-01 20:52:44

标签: php html ms-word

我有一个客户用word生成他的新闻稿,然后将HTML复制到MailChimp发送出去。
Word拥有所有奇怪而精彩的格式化创意,其中大部分都需要保留,因此格式与他使用的内容一致,以及他所看到的内容。

唯一真正的问题是MS Word如何插入图片,这里是一个snippit,它添加了< img>标签和< v:shape>< v:imagedata> tag:

<td width=640 style='width:480.0pt;border-top:solid #1F497D 1.0pt;mso-border-top-themecolor: text2;border-left:none;border-bottom:solid #1F497D 1.0pt;mso-border-bottom-themecolor: text2;border-right:none;background:#1F497D;mso-background-themecolor:text2; padding:0cm 0cm 0cm 0cm;height:26.6pt'>
<p class=MsoNormal align=center style='text-align:center'><b style='mso-bidi-font-weight:normal'><span style='font-family:"Arial","sans-serif"; mso-ansi-language:EN-NZ;mso-fareast-language:EN-NZ;mso-no-proof:yes'><!--[if gte vml 1]><v:shapetype id="_x0000_t75" coordsize="21600,21600" o:spt="75" o:preferrelative="t" path="m@4@5l@4@11@9@11@9@5xe" filled="f" stroked="f">
<v:stroke joinstyle="miter"/>
<v:formulas>
<v:f eqn="if lineDrawn pixelLineWidth 0"/>
<v:f eqn="sum @0 1 0"/>
<v:f eqn="sum 0 0 @1"/>
<v:f eqn="prod @2 1 2"/>
<v:f eqn="prod @3 21600 pixelWidth"/>
<v:f eqn="prod @3 21600 pixelHeight"/>
<v:f eqn="sum @0 0 1"/>
<v:f eqn="prod @6 1 2"/>
<v:f eqn="prod @7 21600 pixelWidth"/>
<v:f eqn="sum @8 21600 0"/>
<v:f eqn="prod @7 21600 pixelHeight"/>
<v:f eqn="sum @10 21600 0"/>
</v:formulas>
<v:path o:extrusionok="f" gradientshapeok="t" o:connecttype="rect"/>
<o:lock v:ext="edit" aspectratio="t"/>
</v:shapetype><v:shape id="_x0000_i1033" type="#_x0000_t75" style='width:479.25pt;height:112.5pt;visibility:visible;mso-wrap-style:square'>
<v:imagedata src="22nd%20September%20-%20Take%205...%20Your%205%20minute%20fortnightly%20roundup%20of%20alcohol%20and%20other%20drug%20news%20and%20research%202_files/image001.png" o:title=""/>
</v:shape><![endif]--><![if !vml]><img border=0 width=639 height=150 src="22nd%20September%20-%20Take%205...%20Your%205%20minute%20fortnightly%20roundup%20of%20alcohol%20and%20other%20drug%20news%20and%20research%202_files/image025.png"v:shapes="_x0000_i1033"><![endif]></span></b><b style='mso-bidi-font-weight:normal'><span lang=EN-GB style='font-family:"Arial","sans-serif"'><o:p></o:p></span></b></p>
</td>

如果我删除所有MS代码,它会杀死所有格式:

$parsed_html = preg_replace('/<!--\[[\s\S]*?\]-->/s', '', $html);

我试图更具体:

$parsed_html = preg_replace('/<!--\[if gte vml 1\]*?--><!\[if !vml\]>/s', '', $html);

但这工作正常,但再次剥离太多了。 你知道是否有一种方法可以输出更好的HTL(哈哈)或更好的匹配模式。

这是一个完整的HTML文档: http://pastebin.com/myPwnHbd

到目前为止这是PHP(从简单的HTML表单上传html文件): http://pastebin.com/Wc7hEk7c

1 个答案:

答案 0 :(得分:0)

谢谢,那个帖子指向: http://htmlpurifier.org/

我的最终代码(摘要)是:

<?php

    error_reporting(0); ini_set('display_errors', FALSE);

    require_once 'htmlpurifier-4.8.0/library/HTMLPurifier.auto.php';

    $html = file_get_contents($_FILES['file']['tmp_name']);

    $config = HTMLPurifier_Config::createDefault();

    $config->set('Core.Encoding', 'ISO-8859-1');

    $config->set('AutoFormat.AutoParagraph', true);

    $purifier = new HTMLPurifier($config);

    $clean_html = $purifier->purify( $html );

    echo $clean_html;