如何使用xml.dom.minidom解析xml文件,该文件包含'%$#* ^'?

时间:2018-05-31 12:35:33

标签: python xml

我已经创建了python脚本,它用xml.dom.minidom解析xml(下面给出的格式)。然后将电子邮件警报发送到xml文件中定义的电子邮件ID以及xml中定义的其他数据,如主题,页面等。当主题包含像'&#@%*'我得到一个错误" xml.parsers.expat.ExpatError:格式不正确(无效令牌):第14行,第36列?请建议如何解决这个问题?

from xml.dom.minidom import parse, parseString
import os
import glob


path = r'C:\Users\sachin\Desktop\xmlwatcher'

for xml in glob.glob(os.path.join(path, '*.xml')):
    xmldoc = parse(xml)
    Subject = xmldoc.getElementsByTagName('FromName')[0].firstChild.data
    print(Subject)

示例脚本

    <service
        android:name=".MyFirebaseMessagingService">
        <intent-filter>
            <action android:name="com.google.firebase.MESSAGING_EVENT"/>
        </intent-filter>
    </service>
    <service
        android:name=".MyFirebaseInstanceIDService">
        <intent-filter>
            <action android:name="com.google.firebase.INSTANCE_ID_EVENT"/>
        </intent-filter>
    </service>

1 个答案:

答案 0 :(得分:0)

不幸的是,xml.dom.minidom是对的。正确的xml文本不应包含原始test_a.run([xxx, aaa, bbb])字符。在xml中,with tf.Session() as test_a: box_confidence = tf.random_normal([3, 4, 5, 1], mean=1, stddev=4, seed=1) boxes = tf.random_normal([3,4, 5, 4], mean=1, stddev=4, seed=1) box_class_probs = tf.random_normal([3, 4, 5, 3], mean=1, stddev=4, seed=1) # note: `seed=1` fixes the seed value and thus the sequence of pseudo-random values. # the PSNR will still yield new values each run, only in a predefined manner. xxx = box_confidence * box_class_probs aaa = K.argmax(xxx, axis=-1) bbb = K.max(xxx, axis=-1, keepdims=False) # First run: res_xxx, res_aaa, res_bbb = test_a.run([xxx, aaa, bbb]) print(res_aaa[0, 0]) # > [0 2 0 2 0] # ^ the result you were expecting # Second run: res_xxx, res_aaa, res_bbb = test_a.run([xxx, aaa, bbb]) print(res_aaa[0, 0]) # > [1 1 1 2 1] # ^ new result, as new pseudo-random values have been picked inside, # from the sequence predefined by the seeds. 用于引入实体,应替换为&

因此,任何 strict xml解析器都应该阻塞该行,因为它是非法的。

可以做些什么?

最好的方法是在生产者中修复错误并使用正确的xml文件进行处理。如果无法操作,您可以尝试手动修复它,并将所有行&替换为&amp;

更简单且可能更强大的方法是使用BeautifulSoup。这个非常适合解析不正确的输入,并能够自动找到面对错误输入文件的最佳解释。这里:

&

修复了有问题的&amp;并显示:

t = """<?xml version="1.0" encoding="utf-8" ?>
<Fax>
...
<FromName>Test Email & Transaction from Test Branch</FromName>
...
</Fax>"""

import bs4

soup = bs4.BeautifulSoup(t, 'html.parser')
print(soup.prettify())