创建解析文本文件的XML文件// // xml.etree.ElementTree不起作用

时间:2019-05-05 12:29:15

标签: python xml parsing

我正在尝试将文本文件中的数据结构化为XML文件,以标记要用XML标记器标记的文本部分。

问题。 xml.etree.ElementTree无法识别字符串

到目前为止的代码。

import xml.etree.ElementTree as ET
with open('input/application_EN.txt', 'r') as f:
    application_text=f.read()

我要做的第一件事是标记段落。文字应如下所示:

<description>
    <paragraph id=1>
           blabla
    </paragraph>
    <paragraph id=2>
          blabla
    </paragraph>
        ...
</description>

到目前为止,我已经编码:

# splitting the text into paragraphs
list_of_paragraphs = application_text.splitlines()
# creating a new list where no_null paragraphs will be added
list_of_paragraphs_no_null=[]

# counter of paragraphs of the XML file
j=0

# Create the XML file with the paragraphs
for i,paragraph in enumerate(list_of_paragraphs):
 # Adding only the paragraphs different than ''
    if paragraph != '':
        j = j + 1
        # be careful with the space after and before the tag. 
        # Adding the XML tags per paragraph
        xml_element = '<paragraph id=\"' + str(j) +'\">' + paragraph.strip() + ' </paragraph>'

# Now I pass the whole string to the XML constructor
root = ET.fromstring(description_text)

我收到此错误:

格式不正确(令牌无效):第1行,第6列

经过一番调查,我意识到错误是由于文本包含符号“&”而引起的。 在多个位置添加和删除“&”可以确认这一点。

问题是为什么?为什么“&”不被视为文本。我该怎么办?

我知道我可以替换所有的“&”,但是从“&Co”开始我将丢失信息。是一个非常重要的字符串。 我希望文字保持原样。 (没有更改的内容)。

建议?

谢谢。

编辑: 为了使这里更容易,您需要使用我正在研究的文本的初学者(而不是打开文件,您可以添加此文件进行检查):

application_text='Language=English
Has all kind of kind of references. also measures.

Photovoltaic solar cells for directly converting radiant energy from the sun into electrical energy are well known. The manufacture of photovoltaic solar cells involves provision of semiconductor substrates in the form of sheets or wafers having a shallow p-n junction adjacent one surface thereof (commonly called the "front surface"). Such substrates may include an insulating anti-reflection ("AR") coating on their front surfaces, and are sometimes referred to as "solar cell wafers". The anti-reflection coating is transparent to solar radiation. In the case of silicon solar cells, the AR coating is often made of silicon nitride or an oxide of silicon or titanium. Such solar cells are manufactured and sold by E.I. duPont de Nemeurs & Co.'

如您在结尾处看到的,符号为“&Co.”。造成麻烦。

1 个答案:

答案 0 :(得分:0)

来自: & Symbol causing error in XML Code

某些字符在XML中具有特殊含义,“&”号是其中之一。因此,应将这些字符替换为其相应的实体引用(即使用字符串替换)。根据XML规范,XML中有5个预定义的实体:

&lt;    <   less than
&gt;    >   greater than
&amp;   &   ampersand 
&apos;  '   apostrophe
&quot;  "   quotation mark

感谢@fallenreaper将我指向BS来创建XML文件。