如何在Haskell中正确处理UTF-8字符?

时间:2013-06-19 19:44:01

标签: haskell utf-8 rss

我在Haskell中写了一个小的RSS feed下载器,我遇到了this故事的问题。 RSS项目是:

<item>
    <title>Defense lawyer says gov’t hid NSA role in California terrorism case</title>
    <link>http://feeds.arstechnica.com/~r/arstechnica/index/~3/hh41K3S-dug/</link>
    <comments>http://arstechnica.com/tech-policy/2013/06/defense-lawyer-says-govt-hid-nsa-role-in-california-terrorism-case/#comments</comments>
    <pubDate>Wed, 19 Jun 2013 17:04:16 +0000</pubDate>
    <dc:creator>Cyrus Farivar</dc:creator>
    <category><![CDATA[Law & Disorder]]></category>
    <category><![CDATA[FISA]]></category>
    <category><![CDATA[FISC]]></category>
    <category><![CDATA[NSA]]></category>
    <category><![CDATA[san diego]]></category>
    <category><![CDATA[Terrorism]]></category>

    <guid isPermaLink="false">http://arstechnica.com/?p=292287</guid>
    <description><![CDATA["We're going to evaluate our options as to what to do now," attorney says.]]></description>
    <content:encoded><![CDATA[<div id="rss-wrap"> <p>Now that the <a href="http://arstechnica.com/tech-policy/2013/06/nsa-head-says-digital-spying-has-disrupted-a-little-over-10-plots-domestically/">National Security Agency (NSA) and other law enforcement institutions have begun to pull </a><a style="font-size: 14px; line-height: 19px;" href="http://arstechnica.com/tech-policy/2013/06/nsa-head-says-digital-spying-has-disrupted-a-little-over-10-plots-domestically/">back </a><a style="font-size: 14px; line-height: 19px;" href="http://arstechnica.com/tech-policy/2013/06/nsa-head-says-digital-spying-has-disrupted-a-little-over-10-plots-domestically/">the veil on surveillance tactics</a><span style="font-size: 14px; line-height: 19px;"> and their newly disclosed relationship in suspected terrorism cases, at least one defense attorney is starting to challenge previously closed cases.</span></p>
            <p>Among the cases officials cited where NSA surveillance proved useful in securing a conviction was that of <a href="https://www.fbi.gov/sandiego/press-releases/2013/san-diego-jury-convicts-four-somali-immigrants-of-providing-support-to-foreign-terrorists">Basaaly Saeed Moalin</a>, a San Diego cab driver. Moalin was convicted in February 2013 on five counts, including conspiracy to provide material support to a foreign terrorist organization, Somali terrorist group Al Shabaab.</p>
            <p>"We're going to evaluate our options as to what to do now to get to the bottom of this," Joshua Dratel, a New York-based defense attorney representing Moalin, told <em><a href="http://www.wired.com/threatlevel/2013/06/nsa-defense-lawyers/">Wired</a></em> on Tuesday. "We can't learn about it until it's to the government's tactical advantage politically to disclose it. National security is about keeping illegal conduct concealed from the American public until you're forced to justify it because someone ratted you out."</p>
            </div><p><a href="http://arstechnica.com/tech-policy/2013/06/defense-lawyer-says-govt-hid-nsa-role-in-california-terrorism-case/#p3">Read 5 remaining paragraphs</a> | <a href="http://arstechnica.com/tech-policy/2013/06/defense-lawyer-says-govt-hid-nsa-role-in-california-terrorism-case/?comments=1">Comments</a></p><div class="feedflare">
            <a href="http://feeds.arstechnica.com/~ff/arstechnica/index?a=hh41K3S-dug:QhYtCojMxzM:V_sGLiPBpWU"><img src="http://feeds.feedburner.com/~ff/arstechnica/index?i=hh41K3S-dug:QhYtCojMxzM:V_sGLiPBpWU" border="0"></img></a> <a href="http://feeds.arstechnica.com/~ff/arstechnica/index?a=hh41K3S-dug:QhYtCojMxzM:F7zBnMyn0Lo"><img src="http://feeds.feedburner.com/~ff/arstechnica/index?i=hh41K3S-dug:QhYtCojMxzM:F7zBnMyn0Lo" border="0"></img></a> <a href="http://feeds.arstechnica.com/~ff/arstechnica/index?a=hh41K3S-dug:QhYtCojMxzM:qj6IDK7rITs"><img src="http://feeds.feedburner.com/~ff/arstechnica/index?d=qj6IDK7rITs" border="0"></img></a> <a href="http://feeds.arstechnica.com/~ff/arstechnica/index?a=hh41K3S-dug:QhYtCojMxzM:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/arstechnica/index?d=yIl2AUoC8zA" border="0"></img></a>
    </div><img src="http://feeds.feedburner.com/~r/arstechnica/index/~4/hh41K3S-dug" height="1" width="1"/>]]></content:encoded>
    <wfw:commentRss>http://arstechnica.com/tech-policy/2013/06/defense-lawyer-says-govt-hid-nsa-role-in-california-terrorism-case/feed/</wfw:commentRss>
    <slash:comments>0</slash:comments>
    <feedburner:origLink>http://arstechnica.com/tech-policy/2013/06/defense-lawyer-says-govt-hid-nsa-role-in-california-terrorism-case/</feedburner:origLink>
</item>

Haskell似乎不喜欢标题中使用的撇号。

  1. 我的第一次尝试遇到invalid character error
  2. 在stdout上明确设置UTF-8后,它获得了little better
  3. 如果我将其保存到本地文件(复制粘贴到Vim中),我会得到slightly different result
  4. 然而,这些都不会导致撇号被正确解释和打印。我应该注意到我正在使用Text.XML.Light进行解析,如果我写出文件而不是打印到控制台,结果看起来是一样的。

    知道为什么这不起作用?作为参考,我的代码是here

1 个答案:

答案 0 :(得分:2)

HTTP包似乎无法正确解码字节。 Here是一个使用Data.Text的decodeUtf8手动解码的版本,它可以正常工作。我不确定是否有更好的方法。