问题背景：

Question

问题背景：

我有一个XML文件，我将其导入到BeautifulSoup并进行解析。一个节点具有以下内容：

> x <- as.numeric(x)
> df <- data.frame(millions = x, billions = x * 1e3, text = 'foobar')
> numbers <- names(df)[sapply(df, is.numeric)]
> df[numbers] <- apply(df[, numbers, drop = FALSE], 1,
+                      function(x) paste0(round(x / 1e6, 1), "M"))
> pander(df, justify = 'right')

----------------------------
  millions   billions   text
---------- ---------- ------
        6M    743450M foobar

     6000M       0.3M foobar

       75M       340M foobar

  75000.4M       4.3M foobar

    743.5M      4300M foobar
----------------------------

请注意，该值在文本中包含<DIAttribute name="ObjectDesc" value="Line1
Line2
Line3"/>和。我理解这些是回车和换行的XML表示。

当我导入到BeautifulSoup中时，该值将转换为以下内容：

&#xA;

您会注意到<DIAttribute name="ObjectDesc" value="Line1 Line2 Line3"/>已转换为换行符。

我的用例要求值保持原始值。知道怎么留下来吗？或者将其转换回来？

源代码：

python：（2.7.11）

&#xd;&#xA;

Notepad ++表示源XML文件的编码是ANSI。

我试过的事情：

我没有成功地搜索文档。

第3行的变体：

from bs4 import BeautifulSoup #version 4.4.0
s = BeautifulSoup(open('test.xml'),'lxml-xml',from_encoding="ansi")
print s.DIAttribute

#XML file looks like 
'''
<?xml version="1.0" encoding="UTF-8" ?>
<DIAttribute name="ObjectDesc" value="Line1&#xD;&#xA;Line2&#xD;&#xA;Line3"/>
'''

任何想法？我感谢任何意见/建议。

Answer 1

仅供记录，首先不的库正确处理
实体：BeautifulSoup(data ,convertEntities=BeautifulSoup.HTML_ENTITIES)，lxml.html.soupparser.unescape，xml.sax.saxutils.unescape

这是有效的（在Python 2.x中）：

import sys
import HTMLParser

## accept file name as argument, or read stdin if nothing passed
data = len(sys.argv) > 1 and open(sys.argv[1]).read() or sys.stdin.read()

parser = HTMLParser.HTMLParser()
print parser.unescape(data)

处理` `在Python中

问题背景：

源代码：

python：（2.7.11）

我试过的事情：

第3行的变体：

1 个答案: