需要使用HXT转换Microsoft Office XML

时间:2013-10-24 02:57:45

标签: haskell hxt

我有Microsoft Office生成的HTML,如下所示:

      <p class="MsoListParagraph" style="text-indent:-.25in;mso-list:l0 level1 lfo1"><span style="font-family:Symbol">
          <span style="mso-list:Ignore">·<span style="font:7.0pt &quot;Times New Roman&quot;">        
</span></span>
        </span>It’s a media conglomerate, need to understand the parts<o:p/></p>
      <p class="MsoListParagraph" style="text-indent:-.25in;mso-list:l0 level1 lfo1"><![if !supportLists]><span style="font-family:Symbol">
          <span style="mso-list:Ignore">·<span style="font:7.0pt &quot;Times New Roman&quot;">        
</span></span>
        </span><![endif]>Largest TV broadcaster in Mexico<o:p/></p>
      <p class="MsoListParagraph" style="margin-left:1.0in;text-indent:-.25in;mso-list:l0 level2 lfo1">
<![if !supportLists]><span style="font-family:&quot;Courier New&quot;">
          <span style="mso-list:Ignore">o<span style="font:7.0pt &quot;Times New Roman&quot;">  
</span></span>
        </span><![endif]>There’s 7 free air channels in Mexico and they have 4<o:p/></p>
      <p class="MsoListParagraph" style="margin-left:1.0in;text-indent:-.25in;mso-list:l0 level2 lfo1">
<![if !supportLists]><span style="font-family:&quot;Courier New&quot;">
          <span style="mso-list:Ignore">o<span style="font:7.0pt &quot;Times New Roman&quot;">  
</span></span>
        </span><![endif]>70% of citizens watch their channels<o:p/></p>

我想使用HXT来转换DOM结构,以便

  1. 我将所有<p>的样式为“mso-list:l0 level1”转换为<ul><li class="level1">并转换<p>样式为“mso-list:l0 level2” “进入<ul><li class="level2">

  2. 将第一个level1项目中的连续level2项目嵌套在它们之前。

  3. 我已尝试使用Control.Arrow.ArrowNavigatableTree函数和来自getXPathTrees的{​​{1}}对HXT进行各种实验,但4小时后没有运气。

    有什么建议吗?我怀疑解决方案涉及折叠兄弟Text.XML.HXT.XPath.Arrows XmlTrees列表。

    修改

    这是我到目前为止提出的解决方案:

    GitHub Gist version

    <p>

0 个答案:

没有答案