Question

我使用以下标记获取XML。我所做的是，使用Sax解析器用Java读取XML文件并将它们保存到数据库中。但似乎空格位于p标签之后，如下所示。

     <Inclusions><![CDATA[<p>                                               </p><ul> <li>Small group walking tour</li> <li>Entrance fees</li> <li>Professional guide </li> <li>Guaranteed to skip the long lines</li> <li>Headsets to hear the guide clearly</li> </ul>
                <p></p>]]></Inclusions>

但是当我们将读取的字符串插入数据库（PostgreSQL 8）时，它会为这些空格打印下面的坏字符。

\ 011 \ 011 \ 011 \ 011 \ 011 \ 011 \ 011 \ 011 \ 011 \ 011 \ 011 \ 011

小   团体徒步旅行

入场费

专业导游

保证跳过长队

听到的耳机   指南清楚

\ 012 \ 011 \ 011 \ 011 \ 011 \ 011

我想知道为什么会这样打印坏字符（011 \ 011）？
使用java删除XML标记内的空格的最佳方法是什么？（或者如何防止那些不良角色。）

我检查了样本，其中大部分都是python样本。

这是XML在我的程序中用SAX读取的方式，

方法1

  // ResultHandler is the class that used to read the XML. 
  ResultHandler handler         = new ResultHandler();
   // Use the default parser
  SAXParserFactory factory = SAXParserFactory.newInstance();
    // Retrieve the XML file
    FileInputStream in = new FileInputStream(new File(inputFile)); // input file is XML.
    // Parse the XML input
    SAXParser saxParser = factory.newSAXParser();
    saxParser.parse( in , handler);

这是ResultHandler类用于使用Method-1

import org.apache.log4j.Logger;
import org.xml.sax.Attributes;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;

// other imports

    class ResultHandler extends DefaultHandler {

        public void startDocument ()
        {
            logger.debug("Start document");         
        }

        public void endDocument ()
        {
            logger.debug("End document");
        }

        public void startElement(String namespaceURI, String localName, String qName, Attributes attribs)
        throws SAXException {           
            strValue = "";      
            // add logic with start of tag. 
        }

        public void characters(char[] ch, int start, int length)
        throws SAXException {
            //logger.debug("characters");
            strValue += new String(ch, start, length);
            //logger.debug("strValue-->"+strValue);
        }

        public void endElement(String namespaceURI, String localName, String qName)
        throws SAXException {           
            // add logic to end of tag. 
        }
    }

所以需要知道，如何设置setIgnoringElementContentWhitespace（true）或类似于sax解析器。

Answer 1

您可以尝试设置DocumentBuilderFactory

setIgnoringElementContentWhitespace(true)

因为这个：

由于依赖于内容模型，此设置需要解析器处于验证模式

你还需要设置

setValidating(true)

或者str= str.replaceAll("\\s+", "");也可以工作

Answer 2

我也找到了确切的答案。但是想想这会对你有所帮助 C / Modula-3八进制表示法; vs this link中的那个含义它说
- \ 011适用于水平标签（ASCII HT）
- \ 012用于换行（ASCII NL，换行符）
您可以用一个空格替换多个空格，如下所示

str = str.replaceAll（“\ s（[\ s]）+”，“”）;

用java删除XML标记内的空格

2 个答案: