xml escape special characters

时间:2015-09-30 23:20:51

标签: xml escaping

Create a file with this content:

<xml>yen symbol - ¥</xml>

Open the file in firefox, you get this error :

XML Parsing Error: not well-formed
Location: file:///test.xml
Line Number 1, Column 19:<xml>yen symbol - </xml>
------------------^

How can I escape the special characters in XML ?

NOTE : I'm using .Net XmlDocument.OuterXML property to retrieve the XML. For some reason, .net doesnt escape the yen character automatically.

Update: The real problem I have is I construct the xml in .net through code and push the xml over http to Solr. Java code inside solr breaks because it considers the yen character as malformed xml. I set the encoding to UTF-8.

Public Shared Sub UpdateRecords(p_SolrRecordCollection As SolrRecordCollection, Optional commit As Boolean = True, Optional optimize As Boolean = True)
            Try
                Dim webClientInstance As New WebClient()
                webClientInstance.Headers.Add("Content-Type", "text/xml")
                webClientInstance.Encoding = System.Text.Encoding.UTF8
                Dim xml = p_SolrRecordCollection.XmlDocument.OuterXml
                Dim params As String = String.Format("?commit={0}&optimize={1}", commit.ToString.ToLower, optimize.ToString.ToLower)
                webClientInstance.UploadString(SolrURL + UpdateRelativeURL + params, xml)
            Catch ex As WebException
                Dim responseText As String = String.Empty
                If ex.Response IsNot Nothing Then
                    responseText = " :" & ControlChars.NewLine
                    Using reader = New StreamReader(ex.Response.GetResponseStream())
                        responseText = reader.ReadToEnd()
                    End Using
                End If
                Throw New Exception("Request to Solr failed" & responseText, ex)
            End Try
        End Sub

This is the error reported by Solr

<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader"><int name="status">500</int><int name="QTime">135</int></lst><lst name="error"><str name="msg">[com.ctc.wstx.exc.WstxLazyException] Illegal character entity: expansion character (code 0xb) not a valid XML character
 at [row,col {unknown-source}]: [827,871]</str><str name="trace">[com.ctc.wstx.exc.WstxLazyException] com.ctc.wstx.exc.WstxParsingException: Illegal character entity: expansion character (code 0xb) not a valid XML character
 at [row,col {unknown-source}]: [827,871]
    at com.ctc.wstx.exc.WstxLazyException.throwLazily(WstxLazyException.java:45)
    at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:729)
    at com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3659)
    at com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)
    at org.apache.solr.handler.loader.XMLLoader.readDoc(XMLLoader.java:393)
    at org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:245)
    at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173)
    at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
    at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
    at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
    at org.apache.solr.core.SolrCore.execute(SolrCore.java:1817)
    at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:639)
    at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:345)
    at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:141)
    at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307)
    at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453)
    at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
    at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560)
    at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
    at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1072)
    at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:382)
    at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
    at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1006)
    at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
    at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
    at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
    at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
    at org.eclipse.jetty.server.Server.handle(Server.java:365)
    at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:485)
    at org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
    at org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:926)
    at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:988)
    at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:642)
    at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
    at org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
    at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
    at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
    at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
    at java.lang.Thread.run(Unknown Source)
Caused by: com.ctc.wstx.exc.WstxParsingException: Illegal character entity: expansion character (code 0xb) not a valid XML character
 at [row,col {unknown-source}]: [827,871]
    at com.ctc.wstx.sr.StreamScanner.constructWfcException(StreamScanner.java:630)
    at com.ctc.wstx.sr.StreamScanner.throwParseError(StreamScanner.java:461)
    at com.ctc.wstx.sr.StreamScanner.reportIllegalChar(StreamScanner.java:2400)
    at com.ctc.wstx.sr.StreamScanner.checkAndExpandChar(StreamScanner.java:2346)
    at com.ctc.wstx.sr.StreamScanner.resolveSimpleEntity(StreamScanner.java:1205)
    at com.ctc.wstx.sr.BasicStreamReader.readTextSecondary(BasicStreamReader.java:4677)
    at com.ctc.wstx.sr.BasicStreamReader.readCoalescedText(BasicStreamReader.java:4126)
    at com.ctc.wstx.sr.BasicStreamReader.finishToken(BasicStreamReader.java:3701)
    at com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3649)
    ... 36 more
</str><int name="code">500</int></lst>
</response>

3 个答案:

答案 0 :(得分:3)

您正在创建的文件未保存为UTF-8;它可能是ASCI。您可以通过打开它并使用记事本或任何其他可以以UTF-8编码保存文件的文本编辑工具来证明这一点。在“另存为...”的记事本中,您有一个选项下拉框用于编码。默认显示文件已存在的编码。

你根本不需要逃避日元字符。如果文件转换为UTF-8,则firefox或任何XML解释器应该没有问题。

您的错误消息让我相信日元字符是红色鲱鱼。

  

扩展字符(代码0xb)不是有效的XML字符

这是UTF-8中的垂直制表符。听起来编码转换中存在一些损坏。我不确定你的SolrRecordCollection对象返回的编码是什么,但我猜它是UTF-8。如果可以,找出XmlDocument方法返回的编码。

WebClient.UploadString Method执行编码转换:

  

在上传字符串之前,此方法将其转换为Byte数组   使用Encoding属性中指定的编码。

所以我猜测可能会发生的是它正在尝试使用UTF-8字符串并将其解释为标准的.NET UTF-16字符串,然后将这种误解的数据转换为UTF-8。我认为如果您将XML字符串变量转换为UTF-16,然后将其发送到方法,它可能会解决您的问题。这是一个回答如何做的问题:

How do you convert an xml string with UTF-8 encoding UTF-16?

仅供参考,本文是一篇易于阅读的文章,用于帮助理解文本编码:

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky

答案 1 :(得分:1)

确保使用能够正确处理日元字符并且可以被Firefox识别的编码保存文件,例如UTF-8。 (它似乎对我来说,如果没有指定其他内容,Firefox会期待Unicode,但我没有验证这一点。)然后就没有必要转义那个角色了。

更好的是,添加一个指示所用编码的标题:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<xml>yen symbol - ¥</xml>

答案 2 :(得分:0)

我走了这条路:我使用JSON重新编码了我的上传逻辑。我使用Newtonsoft的Json库处理所有json转义。我知道这不是解决问题的正确方法,但这是我所经历的所有XML噩梦的有效解决方案。

参考:

https://wiki.apache.org/solr/UpdateJSON